Troubleshooting Wizard descriptive Manual for NMIS 9

This page is intended to provide information on installing and using the Troubleshooting Wizard for NMIS 9, which is intended to help customers run a full diagnostic of their server(s) to determine the root cause of any likely problem. that is being presented.

We will cover from the download and implementation of the Troubleshooting file, going through the complete analysis of the server using each of the interactive menus of the program.

This document is based on the tests mentioned on the Device Troubleshooting Process in NMIS page .

Problem description

In order to run the Troubleshooting Wizard, we must have a very detailed description from the client that gives us a good overview of the situation that it is presenting.

For this, we must solve some important questions:

a) Incident description: what is happening? since when?

b) On which server or servers is the incident occurring?

c) Description of the server or servers in which the incident is occurring: CPUs, RAM, DD, total nodes.

d) In which node or nodes is the incident occurring?

e) Additional details, for example: current NMIS cron settings at least, current settings in /etc/mongod.conf file, database parameters settings in /usr/local/omk/conf/opCommon.json, if any file was recently modified, if any configuration made either on the server or on the computers caused the incident.

Troubleshooting Wizard: Install and Run

You can find the necessary files in the following links (Troubleshooting Wizard file and 2 companion scripts):

Link: https://github.com/tom-tics/TS_Wizard_NMIS9_OPMANTEK

Similarly, a README file showing installation instructions:

The three main files must be uploaded to the server where the analysis will be performed using an FTP client (such as FileZilla).

Once the three files are already on the server, with the necessary permissions, the script must be executed as follows: sh TS_WIZARD_OMK-9.sh

A welcome screen will be displayed and the Enter key must be pressed to access the main menu.

TSW

Pressing the Enter key will show some details of the Operating System, such as the version of the Operating System and a small summary of the memory and CPU of the system.

TSW

In the same way, the Main Menu is shown, where we will find the different options that we can access:

  1. Execute Healthcheck: we can perform a complete review of the server.
  2. Review NMIS Configuration Consistency: we will be able to review the consistency of the most important NMIS configuration files.
  3. Nodes Troubleshooter: we will be able to review the behavior of the nodes added to NMIS.
  4. Execute Smart Diagnostics: creates a full system diagnostic in a .tar.gz file, which can be attached in case a ticket needs to be opened with Opmantek Support.
  5. Create System Backup File: creates a .tar.gz file that will contain a backup of the /etc/* and /usr/local/* folders.
  6. Execute Support Automation Tool: generates an NMIS and an OMK support file, which can be attached in case a ticket needs to be opened with Opmantek Support.

Troubleshooting Wizard: Features

1. Run Healthcheck

You can choose between different options, which are shown below:

TSW

1. TOP

This command gives us information on all the processes that are currently running on the server and the percentage of CPU and RAM memory utilization.

It will always be important to base ourselves on the load average and the %CPU, since if these values ​​are high, we will surely have a problem in one or more processes that are currently running.

At the end of the execution of the command, it shows us a series of tips, such as:

  • Check disk partitions.
  • Clean registry files that take up too much space.
  • Delete cache.

TSW

2. System date and time

It is very important that the server has the correct date and time configured, according to the time zone of each client; This is due to the fact that there are many processes that are executed in specific periods of time and, in the same way, the system logs and the file modification records contain timestamps to be able to detect in case of an error.

That is why this section is included so that the operator knows that the system date and time are adequate. At the end, in case the server does not have NTP activated, a tip is displayed to contact the system administrator and verify it.

TSW

3. Disk R/W

With this analysis, we will be able to realize if there is a physical failure in the server's disks.

The program executes the commands:

  • dd if=/dev/zero of=/data/omkTestFile bs=10M count=1 oflag=direct
  • dd if=/data/omkTestFile of=/dev/null 2>&1

And then it shows the output, which has to be compared with the values:

  • 0.0X s, correct parameters.
  • 0.X s, there is a warning (and could cause a problem).
  • X.0 s, it is critical (and there is a problem).

Similarly, an iostat -x 5 4 is run, which is used to monitor the IO load of the system machine. If you have a high %util, it is very likely that there is a problem that could even lead to data loss, which is signaled at the end of the command execution.

TSW

4. Filesystem

It shows a detailed analysis of the space in each of the system's filesystems, to verify that the possible incident is not occurring due to a lack of space on the server. It also shows a tip that if the use is +85% in any of the files, contact the administrator so that they can be debugged.

In the same way, it executes a command to know the use of the system's RAM and swap memory, showing a tip if there is a high percentage of use to contact the administrator and see what is happening.

TSW

5. Service status

A check of each of the system daemons is run to verify that all essential processes are running correctly.

The following commands are executed:

  • service omkd status
  • service mongod status
  • service nmis9d status
  • service httpd status
  • service opchartsd status
  • service opeventsd status
  • service opconfigd status
  • service opflowd status
  • service crond status
  • service snmpd status
  • service iptables status

Similarly, check that SELinux is disabled.

TSW
If any service is detected to be down and is important for system operation, it must be restarted as indicated by the script.

If the down persists, the log of said daemon should be reviewed and analyzed to see what is happening.

6. Load average

This test allows to know the average load of the system for a defined period of time.

The script shows us some interpretations to know what is happening on the server:

  • If the averages are 0.0, then the system is idle.
  • If the 1 minute average is higher than the 5 or 15 minute averages, then the load is increasing.
  • If the 1 minute average is lower than the 5 or 15 minute averages, then the load is decreasing.
  • If the averages are higher than the CPU count, you may have a performance issue.

TSW

7. Top 5 processes by CPU and Memory

Shows the top 5 processes that are using the most percentage of the CPU on the server, along with CPU and memory details.

At the end it shows a tip that if the processes exceed 85% of the CPU or memory, an investigation is carried out, since it could be a case of processes that have been hung or that do not respond.

TSW


8. tcpdump

The tcpdump command allows us to capture the traffic of the network in which the client's server is located in a file.

With this, the operator can know if there is any problem in the communication between the server and the equipment added to NMIS and its modules, since when analyzing it, they will realize if there is packet loss in the network traffic.

When the command execution finishes, 2 .pcap files are created in the /tmp directory so they can be downloaded and analyzed with Wireshark.



9. Local IP routing table

It allows us to know the status and configuration of the IP route tables, which we use to send and know how the packets are sent in the different networks that are configured on the server in question.

TSW

10. List of logged users

It allows knowing who are the users that are using the shell at that moment, this will serve to maintain a better administration of the people who access and, on some occasions, of those who modify some important system file.

TSW

11. Log user audit

It is important to know the login of each of the users who use the system, this will help to know if any of them made any changes that could have caused the system to malfunction. 

The execution of this section allows you to review system logs, obtain a view of connected users, search for errors, critical messages and alerts in the operating system logs.

At the end, a tip is displayed so that if the operator observes many failed authentication attempts, contact the users so that they can find out what is happening.

TSW


12. Show last used commands

This review goes hand in hand with the previous point and will allow us to know the last 30 commands executed on the server.

Similarly, the 10 most used commands from that list of 30 and the number of times they have been executed.

TSW

13. Show DNS config

The revision of the file /etc/resolve.conf is important, since it will allow us to know if the configuration of the domain names and the redirection to some important IP is correct.

It can be confirmed that the structure of the indicated file is correct.

TSW

14. Internet web test

A test is performed to send three internet packets to the Google server and verify the internet connectivity of the server. This is used to be able to update packages that need to be downloaded via the internet directly to the console, such as yum and cpan.

Similarly, it shows the public IP of the server.

TSW

2. Review NMIS Configuration Consistency

You can choose between different options, which are shown below:

TSW

1. Compare file configurations

Allows you to compare the files:

  • /usr/local/nmis9/conf-default/Config.nmis and /usr/local/nmis9/conf/Config.nmis

In order to find any inconsistency in the configuration that may be causing a problem with the operation of NMIS.

TSW

2. Execute fixperms rutine

Automatically executes the command /usr/local/nmis9/bin/nmis-cli act=fixperms , which allows the operator to general fix the permissions of all system files.

TSW

3. Crontab checking

Runs a configuration check of each of the cron files that NMIS and the modules work with, to check that there is no routine that is causing a conflict that could affect the operation of the system.

In the same way, it executes an ll in /etc/cron.d/ to check that there are no backups inside that folder, since it can cause problems for the execution of the tasks and it gives a tip so that, if backups are found, are moved from folder or deleted.

TSW

4. Last changed files

Run a search for the last modified files in different directories:

  • /nmis9/admin/
  • /nmis9/bin/
  • /nmis9/cgi-bin/
  • /nmis9/conf/
  • /nmis9/conf-default/
  • /nmis9/models-custom/
  • /nmis9/models-default/
  • /nmis9/lib/
  • /omk/conf/
  • /omk/lib/json/
  • /omk/public/omk/
  • /etc/cron.d/

And it arranges them from the most recently modified file to the oldest.

At the end, a tip is displayed for the operator to check if any recent file changes are causing a problem in the system.

TSW

5. Server Performance Tuning

It shows the different parameters that can be modified to improve the performance of the server, more specifically in the files:

  • /nmis9/conf/Config.nmis
  • /omk/conf/opCommon.nmis
  • /etc/mongod.conf

At the end of the execution, it shows the Wikipedia in which all this movement for tuning is detailed: Configuration Options for Server Performance Tuning .

TSW

3. Nodes Troubleshooter

You can choose between different options, which are shown below:

TSW

1. Polling summary Test

Execute the command /usr/local/nmis8/admin/polling_summary.pl, which is used to find out how long the server takes to collect information from the nodes added to NMIS and if any operations are failing or have never been performed (such as SNMP queries, for example).

At the end, you can see a summary of how many nodes have a late collect and, by pressing the l key, you can send this summary to a file so that it can be downloaded from the server.

TSW

2. Traceroute Test

It allows real-time tracking of the path taken by a packet on an IP network from source to destination, reporting the IP addresses of all the routers it pinged between.

Enter the IP or hostname of the node and the script will return the result, displaying a tip for the operator if any abnormal behavior is observed.

TSW

3. MTR Test

Allows you to analyze the connection between the server where the command is executed and the destination host specified by the user.

Enter the IP or hostname of the node and the script will return the result, displaying a tip for the operator if any abnormal behavior is observed.

TSW

4. Ping Test

It allows you to test whether a particular host is reachable through the network configured on the server and measure the time it takes for packets to be sent and received.

Enter the IP or hostname of the node and the script will return the result, displaying a tip for the operator if any abnormal behavior is observed.

TSW

5. SNMP Test

Allows you to query the SNMP data of a device. The snmpwalk command is used because it allows the user to chain requests without having to enter unique commands for each OID or node within a subtree.

This helps to know if the node in question is responding correctly to the protocol and to verify that NMIS is collecting its metrics correctly.

The script has SNMPv1, SNMPv2 or SNMPv3 queries and at the end it shows a tip for the operator to consult the administrator in case the device has problems in the response.

TSW

TSW

TSW


6. Update nodes Test

Allows you to perform an update to a specific node, using its hostname.

The command /usr/local/nmis9/bin/nmis-cli is executed act=schedule job.type=update job.verbosity=1 job.node=nodename job.force=1

TSW

7. Collect nodes Test

Allows you to perform a collect to a specific node, using its hostname.

The command /usr/local/nmis9/bin/nmis-cli is executed act=schedule job.type=collect job.verbosity=1 job.node=nodename job.force=1

TSW

8. Event search

It allows searches in the /usr/local/nmis9/logs/ and /usr/local/omk/logs/ folders, which will make it easier for the operator to investigate any fact or event that is causing a server failure.

The word or words to be searched must be placed in order to carry out the operation.

TSW

9. Backup nodes

Allows you to make a backup of the current properties of the nodes (remembering that there is no longer a Nodes.nmis file as such).

This is very important for the operator, especially before making any changes that have to do with the equipment added to NMIS.

TSW

4. Run Smart Diagnostics

It allows automatically executing all the tests contained in the script just by accessing the corresponding option.

At the end, a .tar.gz file is generated that must be attached by the operator if a Support ticket is opened, as mentioned in the tip.

TSW

5. Create System Backup File

Make a backup copy of the configuration directories to preserve all the settings made by the client.

The folder in which this backup will be made must be indicated, in this example we use /tmp and the script will start executing it.

The program displays the tree of the backed up files and folders and the name of the generated .tar.gz file.

TSW


6. Run Support Automation Tool

Allows you to run the NMIS support tool and modules, which collects all the relevant information about the status and configuration of the server in 2 files:

  • nmis-support.zip
  • omk-support.zip

In the end, these two files should be attached to the email sent to Opmantek Support for analysis.

TSW