Troubleshooting Wizard descriptive Manual for NMIS 8
This page is intended to provide information on the installation and use of the Troubleshooting Wizard, which is intended to help customers run a full diagnostic of their server(s) to determine the root cause of any likely problem being encountered. presenting.
We will cover from the download and implementation of the Troubleshooting file, going through the complete analysis of the server using each of the interactive menus of the program.
This document is based on the tests mentioned on the NMIS Device Troubleshooting Process page.
Problem description
In order to run the Troubleshooting Wizard, we must have a very detailed description from the client that gives us a good overview of the situation that it is presenting.
For this, we must solve some important questions:
a) Incident description: what is happening? since when?
b) On which server or servers is the incident occurring?
c) Description of the server or servers in which the incident is occurring: CPUs, RAM, DD, total nodes.
d) In which node or nodes is the incident occurring?
e) Additional details, for example: current NMIS cron settings at least, current settings in /etc/mongod.conf file, database parameters settings in /usr/local/omk/conf/opCommon.nmis, if any file was recently modified, if any configuration made either on the server or on the computers caused the incident.
Troubleshooting Wizard: Install and run
The script installation file (01_TS_Wizard_OMK.sh) and the two complementary scripts for its execution (Busqueda.pl and config_backup_LATAM.pl), can be downloaded from the following GitHub link: https://github.com/tom-tics/ TS_Wizard_NMIS8_OPMANTEK.
All three must be uploaded to the same folder on the server where the analysis will be performed using an FTP client (such as FileZilla).
Once the three files are already on the server, it must be executed with the command: sh 01_TS_Wizard_OMK.sh
Once we have executed the file, we will access the initial screen, where the details of the Operating System are shown, such as the version of Linux and a small summary of the memory and CPU of the system.
In the same way, the Main Menu is shown, where we will find the different options that we can access:
- Execute Healthcheck: we can perform a complete review of the server.
- NMIS Configuration Consistency: we will be able to review the consistency of the most important NMIS configuration files.
- Nodes Troubleshooter: we will be able to review the behavior of the nodes added to NMIS.
- Smart Diagnostic: creates a full system diagnostic in a .tar.gz file, which can be attached in case a ticket needs to be opened with Opmantek Support
- Create System Backup File: creates a .tar.gz file that will contain a backup of the /etc/* and /usr/local/* folders.
- Execute Support Automation Tool: generates an NMIS and an OMK support file, which can be attached in case a ticket needs to be opened with Opmantek Support.
Troubleshooting Wizard: Features
1. Execute Healthcheck
You can choose between different options, which are shown below:
1. TOP
This command gives us information on all the processes that are currently running on the server and the percentage of CPU and RAM memory utilization.
It will always be important to base ourselves on the load average and the %CPU, since if these values are high, we will surely have a problem in one or more processes that are currently running.
At the end of the execution of the command, it shows us a series of tips, such as:
- Check disk partitions.
- Clean registry files that take up too much space.
- Delete cache.
2. System date and time
It is very important that the server has the correct date and time configured, according to the time zone of each client; This is due to the fact that there are many processes that are executed in specific periods of time and, in the same way, the system logs and the file modification records contain timestamps to be able to detect in case of an error.
That is why this section is included so that the operator knows that the system date and time are adequate. At the end, in case the server does not have NTP activated, a tip is displayed to contact the system administrator and verify it.
3. Disk R/W
With this analysis, we will be able to realize if there is a physical failure in the server's disks.
The program executes the commands:
- dd if=/dev/zero of=/data/omkTestFile bs=10M count=1 oflag=direct
- dd if=/data/omkTestFile of=/dev/null 2>&1
And then it shows the output, which has to be compared with the values:
- 0.0X s, correct parameters.
- 0.X s, there is a warning (and could cause a problem).
- X.0 s, it is critical (and there is a problem).
Similarly, an iostat -x 5 4 is run, which is used to monitor the IO load of the system machine. If you have a high %util, it is very likely that there is a problem that could even lead to data loss, which is signaled at the end of the command execution.
4. Filesystem
It shows a detailed analysis of the space in each of the system's filesystems, to verify that the possible incident is not occurring due to a lack of space on the server. It also shows a tip that if the use is +85% in any of the files, contact the administrator so that they can be debugged.
In the same way, it executes a command to know the use of the system's RAM and swap memory, showing a tip if there is a high percentage of use to contact the administrator and see what is happening.
5. Service status
A check of each of the system daemons is run to verify that all essential processes are running correctly.
The following commands are executed:
- service omkd status
- service mongod status
- service nmisd status (if applicable)
- service nmis9d status (if applicable)
- service httpd status
- service opchartsd status
- service opeventsd status
- service opconfigd status
- service opflowd status
- service crond status
- service snmpd status
- service iptables status
Similarly, check that SELinux is disabled.
If any service is detected to be down and is important for system operation, it must be restarted as indicated by the script.
If the down persists, the log of that daemon should be reviewed and analyzed to see what is happening.
6. Load average
This test allows to know the average load of the system for a defined period of time.
The script shows us some interpretations to know what is happening on the server:
- If the averages are 0.0, then the system is idle.
- If the 1 minute average is higher than the 5 or 15 minute averages, then the load is increasing.
- If the 1 minute average is lower than the 5 or 15 minute averages, then the load is decreasing.
- If the averages are higher than the CPU count, you may have a performance issue.
7. Top 5 processes by CPU and Memory
Shows the top 5 processes that are using the most percentage of the CPU on the server, along with CPU and memory details.
At the end it shows a tip that if the processes exceed 85% of the CPU or memory, an investigation is carried out, since it could be a case of processes that have been hung or that do not respond.
8. Tcpdump
The tcpdump command allows us to capture the traffic of the network in which the client's server is located in a file.
With this, the operator can know if there is any problem in the communication between the server and the equipment added to NMIS and its modules, since when analyzing it, they will realize if there is packet loss in the network traffic.
When the command execution finishes, 2 .pcap files are created in /tmp so they can be downloaded and analyzed with Wireshark.
9. Local IP routing table
It allows us to know the status and configuration of the IP route tables, which are used to send and know how the packets are sent in the different networks that are configured on the server.
10. List of logged users.
It allows knowing who are the users that are using the shell at that moment, this will serve to maintain a better administration of the people who access and, on some occasions, of those who modify some important system file.
11. Log user audit
It is important to know the login of each of the users who use the system, this will help to know if any of them made any changes that could have caused the system to malfunction.
The execution of this section allows you to review system logs, obtain a view of connected users, search for errors, critical messages and alerts in the operating system logs.
At the end, a tip is displayed so that if the operator observes many failed authentication attempts, contact the users so that they can find out what is happening.
12. Show last used commands
This review goes hand in hand with the previous point and will allow us to know the last 30 commands executed on the server.
Similarly, the 10 most used commands from that list of 30 and the number of times they have been executed.
13. Show DNS config
The revision of the file /etc/resolve.conf is important, since it will allow us to know if the configuration of the domain names and the redirection to some important IP is correct.
It can be confirmed that the structure of the indicated file is correct.
14. Internet web test
A test is performed to send three internet packets to the Google server and verify the internet connectivity of the server. This is used to be able to update packages that need to be downloaded via the internet directly to the console, such as yum and cpan.
Similarly, it shows the public IP of the server.
2. NMIS Configuration Consistency
You can choose between different options, which are shown below:
1. Check NMIS code
It allows to check the syntax of the configuration files in the /usr/local/nmis8/* folder and shows if there are any errors in the codes.
A tip is displayed for the operator to review the files that are found to have any inconsistencies.
2. Perform a configuration backup
Make a backup copy of the configuration directories to preserve all the settings made by the client.
The folder in which this backup will be made must be indicated, in this example we use /tmp and the script will start executing it.
The program displays the tree of the backed up files and folders and the name of the generated .tar.gz file.
3. Compare file configurations
Allows you to compare the files:
- /usr/local/nmis8/install/Config.nmis and /usr/local/nmis8/conf/Config.nmis
- /usr/local/omk/install/opCommon.nmis and /usr/local/omk/conf/opCommon.nmis
In order to find any inconsistency in the configuration that may be causing a problem with NMIS and/or the modules.
4. Execute fixperms rutine
It automatically executes the /usr/local/nmis8/admin/fixperms.pl command, which allows the operator to general correct the permissions of all system files.
5. Model checking
Runs a variable-length check and syntax validation on the files in the /usr/local/nmis8/models/* models folder.
It is important so that the operator can have each of the different models of equipment added to NMIS working correctly.
If the script finds any detail, it signals it and at the end it gives a tip for the operator to review that inconsistency.
6. Crontab checking
Runs a configuration check of each of the cron files that NMIS and the modules work with, to check that there is no routine that is causing a conflict that could affect system operation.
In the same way, it executes an ll in /etc/cron.d/ to check that there are no backups inside that folder, since it can cause problems for the execution of the tasks and it gives a tip so that, if backups are found, are moved from folder or deleted.
7. Verify CPAN libraries
Runs a check of the CPAN libraries and displays which ones are missing so the operator can install them if needed.
8. Last changed files
Run a search for the last modified files in different directories:
- /nmis8/admin/
- /nmis8/bin/
- /nmis8/cgi-bin/
- /nmis8/conf/
- /nmis8/models/
- /nmis8/lib/
- /omk/conf/
- /etc/cron.d/
And it arranges them from the most recently modified file to the oldest.
At the end, a tip is displayed for the operator to check if any recent file changes are causing a problem in the system.
3. Nodes Troubleshooter
You can choose between different options, which are shown below:
1. Polling summary
Execute the command /usr/local/nmis8/admin/polling_summary.pl, which is used to find out how long the server takes to collect information from the nodes added to NMIS and if any operations are failing or have never been performed (such as SNMP queries, for example).
At the end, you can see a summary of how many nodes have a late collect and, by pressing the l key, you can send this summary to a file so that it can be downloaded from the server.
2. Traceroute
It allows real-time tracking of the path taken by a packet on an IP network from source to destination, reporting the IP addresses of all the routers it pinged between.
Enter the IP or hostname of the node and the script will return the result, displaying a tip for the operator if any abnormal behavior is observed.
3. MTR
Allows you to analyze the connection between the server where the command is executed and the destination host specified by the user.
Enter the IP or hostname of the node and the script will return the result, displaying a tip for the operator if any abnormal behavior is observed.
4. Ping
It allows you to test whether a particular host is reachable through the network configured on the server and measure the time it takes for packets to be sent and received.
Enter the IP or hostname of the node and the script will return the result, displaying a tip for the operator if any abnormal behavior is observed.
5. SNMP
Allows you to query the SNMP data of a device. The snmpwalk command is used because it allows the user to chain requests without having to enter unique commands for each OID or node within a subtree.
This helps to know if the node in question is responding correctly to the protocol and to verify that NMIS is collecting its metrics correctly.
The script has SNMPv1, SNMPv2 or SNMPv3 queries and at the end it shows a tip for the operator to consult the administrator in case the device has problems in the response.
6. Update
nodes
Allows you to perform an update to a specific node, using its hostname.
The command /usr/local/nmis8/bin/nmis.pl is executed type=update node='node' force=1 debug=1
7. Collect nodes
Allows you to perform a collect to a specific node, using its hostname.
The command /usr/local/nmis8/bin/nmis.pl is executed type=collect node='node' force=1 debug=1
8. Event search
It allows searches in the /usr/local/nmis8/logs/ and /usr/local/omk/logs/ folders, which will make it easier for the operator to investigate any fact or event that is causing a failure in the server.
The word or words to be searched must be placed in order to carry out the operation.
9. Nodes.nmis backup
Allows you to make a backup of the current Nodes.nmis file, located in /usr/local/nmis8/conf/.
This is very important for the operator, especially before making any changes that have to do with the equipment added to NMIS.
10. Support zip
Allows you to run the NMIS support tool and modules, which collects all the relevant information about the status and configuration of the server in 2 files:
- nmis-support.zip
- omk-support.zip
In the end, these two files should be attached to the email sent to Opmantek Support for analysis.
4. Smart Diagnostic
It allows automatically executing all the tests contained in the script just by accessing the corresponding option.
At the end, a .tar.gz file is generated that must be attached by the operator if a Support ticket is opened, as mentioned in the tip.
5. Create System Backup File
Make a backup copy of the configuration directories to preserve all the settings made by the client.
The folder in which this backup will be made must be indicated, in this example we use /tmp and the script will start executing it.
The program displays the tree of the backed up files and folders and the name of the generated .tar.gz file.
6. Execute Support Automation Tool
Allows you to run the NMIS support tool and modules, which collects all the relevant information about the status and configuration of the server in 2 files:
- nmis-support.zip
- omk-support.zip
In the end, these two files should be attached to the email sent to Opmantek Support for analysis.