Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This document is based on the tests mentioned on the NMIS Device Troubleshooting Process page.

Table of Contents

...

Problem description

In order to run the Troubleshooting Wizard, we must have a very detailed description from the client that gives us a good overview of the situation that it is presenting.

...

e) Additional details, for example: current NMIS cron settings at least, current settings in /etc/mongod.conf file, database parameters settings in /usr/local/omk/conf/opCommon.nmis, if any file was recently modified, if any configuration made either on the server or on the computers caused the incident.

Troubleshooting Wizard: Install and run

The script installation file (01_TS_Wizard_OMK.sh) and the two complementary scripts for its execution (Busqueda.pl and config_backup_LATAM.pl), can be obtained downloaded from the following GitHub link: https://github.com/tom-tics/ TS_Wizard_NMIS8_OPMANTEK .

It All three must be downloaded and uploaded to the same folder on the server where you want to perform the analysis will be performed using an FTP client (such as FileZilla), to the folder desired by the client.

Once the file is three files are already on the server, we execute it must be executed with the command: sh 01_TS_Wizard_OMK.sh

...

In the same way, the Main Menu is shown, where we will find the different options that we can access:

  1. Execute Healthcheck: we can perform a complete review of the server.
  2. NMIS Configuration Consistency: we will be able to review the consistency of the most important NMIS configuration files.
  3. Nodes Troubleshooter: we will be able to review the behavior of the nodes added to NMIS.
  4. Smart Diagnostic: creates a full system diagnostic in a .tar.gz file, which can be attached in case a ticket needs to be opened with Opmantek Support
  5. Create System Backup File: creates a .tar.gz file that will contain a backup of the /etc/* and /usr/local/* folders.
  6. Execute Support Automation Tool: generates an NMIS and an OMK support file, which can be attached in case a ticket needs to be opened with Opmantek Support.

Troubleshooting Wizard: Features

1. Execute Healthcheck

You can choose between different options, which are shown below:

TSW

1. TOP

This command gives us information on all the processes that are currently running on the server and the percentage of CPU and RAM memory utilization.

...

  • Check disk partitions.
  • Clean registry files that take up too much space.
  • Delete cache.

TSW

2. System date and time

It is very important that the server has the correct date and time configured, according to the time zone of each client; This is due to the fact that there are many processes that are executed in specific periods of time and, in the same way, the system logs and the file modification records contain timestamps to be able to detect in case of an error.

That is why this section is included so that the operator knows that the system date and time are adequate. At the end, in case the server does not have NTP activated, a tip is displayed to contact the system administrator and verify it.

TSW

3. Disk R/W

With this analysis, we will be able to realize if there is a physical failure in the server's disks.

...

Similarly, an iostat -x 5 4 is run, which is used to monitor the IO load of the system machine. If you have a high %util, it is very likely that there is a problem that could even lead to data loss, which is signaled at the end of the command execution.

TSW

4. Filesystem

It shows a detailed analysis of the space in each of the system's filesystems, to verify that the possible incident is not occurring due to a lack of space on the server. It also shows a tip that if the use is +85% in any of the files, contact the administrator so that they can be debugged.

In the same way, it executes a command to know the use of the system's RAM and swap memory, showing a tip if there is a high percentage of use to contact the administrator and see what is happening.

TSW

5. Service status

A check of each of the system daemons is run to verify that all essential processes are running correctly.

...

If the down persists, the log of that daemon should be reviewed and analyzed to see what is happening.

6. Load average

This test allows to know the average load of the system for a defined period of time.

...

  • If the averages are 0.0, then the system is idle.
  • If the 1 minute average is higher than the 5 or 15 minute averages, then the load is increasing.
  • If the 1 minute average is lower than the 5 or 15 minute averages, then the load is decreasing.
  • If the averages are higher than the CPU count, you may have a performance issue.

TSW

7. Top 5 processes by CPU and Memory

Shows the top 5 processes that are using the most percentage of the CPU on the server, along with CPU and memory details.

At the end it shows a tip that if the processes exceed 85% of the CPU or memory, an investigation is carried out, since it could be a case of processes that have been hung or that do not respond.

TSW


8. Tcpdump

The tcpdump command allows us to capture the traffic of the network in which the client's server is located in a file.

...

When the command execution finishes, 2 .pcap files are created in /tmp so they can be downloaded and analyzed with Wireshark.

TSW

9. Local IP routing table

It allows us to know the status and configuration of the IP route tables, which are used to send and know how the packets are sent in the different networks that are configured on the server.

10. List of logged users.

It allows knowing who are the users that are using the shell at that moment, this will serve to maintain a better administration of the people who access and, on some occasions, of those who modify some important system file.

TSW

11. Log user audit

It is important to know the login of each of the users who use the system, this will help to know if any of them made any changes that could have caused the system to malfunction. 

...

At the end, a tip is displayed so that if the operator observes many failed authentication attempts, contact the users so that they can find out what is happening.

TSW


12. Show last used commands

This review goes hand in hand with the previous point and will allow us to know the last 30 commands executed on the server.

Similarly, the 10 most used commands from that list of 30 and the number of times they have been executed.

TSW

13. Show DNS config

The revision of the file /etc/resolve.conf is important, since it will allow us to know if the configuration of the domain names and the redirection to some important IP is correct.

It can be confirmed that the structure of the indicated file is correct.

TSW

14.

...

Internet web test

A test is performed to send three internet packets to the Google server and verify the internet connectivity of the server. This is used to be able to update packages that need to be downloaded via the internet directly to the console, such as yum and cpan.

Similarly, it shows the public IP of the server.

TSW

2. NMIS Configuration Consistency

You can choose between different options, which are shown below:

TSW

1. Check NMIS code

It allows to check the syntax of the configuration files in the /usr/local/nmis8/* folder and shows if there are any errors in the codes.

A tip is displayed for the operator to review the files that are found to have any inconsistencies.

TSW

2. Perform a configuration backup

Make a backup copy of the configuration directories to preserve all the settings made by the client.

...

The program displays the tree of the backed up files and folders and the name of the generated .tar.gz file.

TSW

3. Compare file configurations

Allows you to compare the files:

...

In order to find any inconsistency in the configuration that may be causing a problem with NMIS and/or the modules.

TSW

4. Execute fixperms rutine

It automatically executes the /usr/local/nmis8/admin/fixperms.pl command, which allows the operator to general correct the permissions of all system files.

TSWImage Modified

5. Model checking

Runs a variable-length check and syntax validation on the files in the /usr/local/nmis8/models/* models folder.

...

If the script finds any detail, it signals it and at the end it gives a tip for the operator to review that inconsistency.

TSW

6. Crontab checking

Runs a configuration check of each of the cron files that NMIS and the modules work with, to check that there is no routine that is causing a conflict that could affect system operation.

In the same way, it executes an ll in /etc/cron.d/ to check that there are no backups inside that folder, since it can cause problems for the execution of the tasks and it gives a tip so that, if backups are found, are moved from folder or deleted.

TSW

7. Verify CPAN libraries

Runs a check of the CPAN libraries and displays which ones are missing so the operator can install them if needed.

TSW

8. Last changed files

Run a search for the last modified files in different directories:

...

At the end, a tip is displayed for the operator to check if any recent file changes are causing a problem in the system.

TSW

3. Nodes Troubleshooter

You can choose between different options, which are shown below:

TSW

1. Polling summary

Execute the command /usr/local/nmis8/admin/polling_summary.pl, which is used to find out how long the server takes to collect information from the nodes added to NMIS and if any operations are failing or have never been performed (such as SNMP queries, for example).

At the end, you can see a summary of how many nodes have a late collect and, by pressing the l key, you can send this summary to a file so that it can be downloaded from the server.

TSW

2. Traceroute

It allows real-time tracking of the path taken by a packet on an IP network from source to destination, reporting the IP addresses of all the routers it pinged between.

Enter the IP or hostname of the node and the script will return the result, displaying a tip for the operator if any abnormal behavior is observed.

TSW

3. MTR

Allows you to analyze the connection between the server where the command is executed and the destination host specified by the user.

Enter the IP or hostname of the node and the script will return the result, displaying a tip for the operator if any abnormal behavior is observed.

TSW

4. Ping

It allows you to test whether a particular host is reachable through the network configured on the server and measure the time it takes for packets to be sent and received.

Enter the IP or hostname of the node and the script will return the result, displaying a tip for the operator if any abnormal behavior is observed.

TSW

5. SNMP


Allows you to query the SNMP data of a device. The snmpwalk command is used because it allows the user to chain requests without having to enter unique commands for each OID or node within a subtree.

...

The script has SNMPv1, SNMPv2 or SNMPv3 queries and at the end it shows a tip for the operator to consult the administrator in case the device has problems in the response.

TSWImage Added

TSWImage RemovedTSWImage Added

TSWImage RemovedTSWImage Added


6. Update
 nodes

Allows you to perform an update to a specific node, using its hostname.

The command /usr/local/nmis8/bin/nmis.pl is executed type=update node='node' force=1 debug=1

TSW

7. Collect nodes

Allows you to perform a collect to a specific node, using its hostname.

The command /usr/local/nmis8/bin/nmis.pl is executed type=collect node='node' force=1 debug=1

TSW

8. Event search

It allows searches in the /usr/local/nmis8/logs/ and /usr/local/omk/logs/ folders, which will make it easier for the operator to investigate any fact or event that is causing a failure in the server.

The word or words to be searched must be placed in order to carry out the operation.In the end, that search will be stored in a text file to be able to extract it from the server and analyze it in a better way. -PENDING-

TSW

9. Nodes.nmis backup

Allows you to make a backup of the current Nodes.nmis file, located in /usr/local/nmis8/conf/.

This is very important for the operator, especially before making any changes that have to do with the equipment added to NMIS.

TSW

10. Support zip

Allows you to run the NMIS support tool and modules, which collects all the relevant information about the status and configuration of the server in 2 files:

...

In the end, these two files should be attached to the email sent to Opmantek Support for analysis.

TSW

4. Smart Diagnostic

It allows automatically executing all the tests contained in the script just by accessing the corresponding option.

At the end, a .tar.gz file is generated that must be attached by the operator if a Support ticket is opened, as mentioned in the tip.

TSW

5. Create System Backup File

Make a backup copy of the configuration directories to preserve all the settings made by the client.

...

The program displays the tree of the backed up files and folders and the name of the generated .tar.gz file.

TSW

6. Execute Support Automation Tool

Allows you to run the NMIS support tool and modules, which collects all the relevant information about the status and configuration of the server in 2 files:

...