This page is intended to provide a NMIS Device Troubleshooting Process to Identify bad behaviors in collection for NMIS8/9 products, you can break it down into clear steps that anyone can follow and identify what's wrong with the device collection also if we have Gaps in Graphs for the nodes managed by NMIS.
Device Troubleshooting Process
- Identify the problem. The first step in troubleshooting a device issue is to identify the problem, you have to consider if the issue is in NMIS8 or NMIS9 products.
- Add to the support the case the product version and the servers/devices/models involved.
- What kind of problem are you observing. A device issue can be affected for the next reasons.
- Network performance, latency in the network, layer 1,2, and 3 issues.
- Device configuration, connectivity, SNMP configuration, and others.
- Server hardware requirements, high resource utilization parameters in the server.
- Server configuration options, missing configuration items for server tunning.
- Disk performance, slow write/read times for the device collection.
- Gather information, collect all the graphs, images, behaviors that can explain what the problem is.
- Collect support tool files The Opmantek Support Tool
Execute the collect command for the support tool
#General collection.
/usr/local/nmis8/admin/support
.pl action=collect
#If the file is big, we can add the next parameter.
/usr/local/nmis8/admin/support
.pl action=collect maxzipsize=900000000
#Device collection.
/usr/local/nmis8/admin/support
.pl action=collect node=<node_name>
- If you are using NMIS8, provide the /usr/local/nmis8/var files
go to /usr/local/nmis8/var directory and collect the next files
-rw-rw---- 1 nmis nmis 4292 Apr 5 18:26 <node_name>-node.json
-rw-rw---- 1 nmis nmis 2695 Apr 5 18:26 <node_name>-view.json
obtain update/collect outputs this information will upload to the support case:
/usr/local/nmis8/bin/nmis
.pl
type
=update node=<node_name> model=
true
debug=9 force=
true
>
/tmp/node_name_update_
$(
hostname
).log
/usr/local/nmis8/bin/nmis
.pl
type
=collect node=<node_name> model=
true
debug=9 force=
true
>
/tmp/node_name_collect_
$(
hostname
).log
If you are using NMIS9, include the dump files.
/usr/local/nmis9/admin/node_admin
.pl act=dump
{node=nodeX|uuid=nodeUUID}
file
=<MY PATH> everything=1
- Collect support tool files The Opmantek Support Tool
- Replicate the problem. If possible you have to define, what the steps are to replicate the problem.
- Identify symptoms. To this point, you are able to see a specific problem and what the symptoms are.
- Determinate if something has changed, is important to verify with your team if something has changed, a good way to see this behavior is monitoring the performance graph for devices and server
- It is an individual problem? verify if this behavior is happening in a single device/server.
Network performance - NMIS Server.
This section is focused on performing the review and validation of the server status in general, we will focus on verifying the historical behavior of the main metrics for the server, it is important to review all the metrics related to the good performance between the server and devices
Verifying Health Metrics
- Metrics are important for the server, NMIS would use Reachability, Availability and Health to represent the network.
Reachability being the pingability of device,
Availability being (in the context of network gear) the interfaces which should be up, being up or not, e.g. interfaces which are "no shutdown" (ifAdminStatus = up) should be up, so a device with 10 interfaces of ifAdminStatus = up and ifOperStatus = up for 9 interfaces, the device would be 90% available.
Health is a composite metric, made up of many things depending on the device, router, CPU, memory. Something interesting here is that part of the health is made up of an inverse of interface utilisation, so an interface which has no utilisation will have a high health component, an interface which is highly utilised will reduce that metric. So the health is a reflection of load on the device, and will be very dynamic.
The overall metric of a device is a composite metric made up of weighted values of the other metrics being collected. The formula for this is based is configurable, so you can have weight Reachability to be higher than it currently is, or lower, your choice.
For more references go to NMIS Metrics, Reachability, Availability and Health
- It is important to validate the localhost heath, including the overall reachability, availability, and Health you will be able to see data not following the historical data pattern that can give us a clue where the problem can be happening or even if the abnormal behavior has started before a change request In the early hours.
- Viewing the graphs referring to the network performance as (Response Time in milliseconds, IP Utilization, TCP Connection, TCP Segments) will help us to identify the behavior of the server/network in a period of 2 days, we can modify this period time to see more data if needed.
Device configuration.
It is important to validate if the problem occurs in the network or is something related to the device configuration, in order to identify what's happening we need to validate the next commands from the console server.
Ping test, The Ping tool is used to test whether a particular host is reachable across an IP network. A Ping measures the time it takes for packets to be sent from the local host to a destination computer and back.
ping
x.x.x.x
#add the ip address you need to reach
Traceroute, is a network diagnostic tool used to track in real-time the pathway taken by a packet on an IP network from source to destination, reporting the IP addresses of all the routers it pinged in between
traceroute
<ip_Node>
#add the ip address you need to reach
MTR, Mtr(my traceroute) is a command-line network diagnostic tool that provides the functionality of both the ping and traceroute commands
sudo
mtr -r 8.8.8.8
[sample results below]
HOST: endor Loss% Snt Last Avg Best Wrst StDev
1. 69.28.84.2 0.0% 10 0.4 0.4 0.3 0.6 0.1
2. 38.104.37.141 0.0% 10 1.2 1.4 1.0 3.2 0.7
3. te0-3-1-1.rcr21.dfw02.atlas. 0.0% 10 0.8 0.9 0.8 1.0 0.1
4. be2285.ccr21.dfw01.atlas.cog 0.0% 10 1.1 1.1 0.9 1.4 0.1
5. be2432.ccr21.mci01.atlas.cog 0.0% 10 10.8 11.1 10.8 11.5 0.2
6. be2156.ccr41.ord01.atlas.cog 0.0% 10 22.9 23.1 22.9 23.3 0.1
7. be2765.ccr41.ord03.atlas.cog 0.0% 10 22.8 22.9 22.8 23.1 0.1
8. 38.88.204.78 0.0% 10 22.9 23.0 22.8 23.9 0.4
9. 209.85.143.186 0.0% 10 22.7 23.7 22.7 31.7 2.8
10. 72.14.238.89 0.0% 10 23.0 23.9 22.9 32.0 2.9
11. 216.239.47.103 0.0% 10 50.4 61.9 50.4 92.0 11.9
12. 216.239.46.191 0.0% 10 32.7 32.7 32.7 32.8 0.1
13. ??? 100.0 10 0.0 0.0 0.0 0.0 0.0
14. google-public-dns-a.google.c 0.0% 10 32.7 32.7 32.7 32.8 0.0
snmpwalk, is a Simple Network Management Protocol (SNMP) application present on the Security Management System (SMS) CLI that uses SNMP GETNEXT requests to query a network device for information. An object identifier (OID) may be given on the command line.
The following example CLI
command
will
return
the IPS temperature information:
Command:snmpwalk -
v
2c -c tinapc <IP address> 1.3.6.1.4.1.10734.3.5.2.5.5
Command Explanation:
In this
case
the CLI
command
breaks down as following;
snmpwalk = SNMP application
-
v
2c = specifies what SNMP version to use (1, 2c, 3)
-c tinapc = specifies the community string. Note: The IPS has the SNMP
read
-only community string of
"tinapc"
<IP address> = specifies the IP address of the IPS device
1.3.6.1.4.1.10734.3.5.2.5.5 = OID parameter
for
the IPS temperature information
Results:
SNMPv2-SMI::enterprises.10734.3.5.2.5.5.1.0 = INTEGER: 27
SNMPv2-SMI::enterprises.10734.3.5.2.5.5.2.0 = INTEGER: 50
SNMPv2-SMI::enterprises.10734.3.5.2.5.5.3.0 = INTEGER: 55
SNMPv2-SMI::enterprises.10734.3.5.2.5.5.4.0 = INTEGER: 0
SNMPv2-SMI::enterprises.10734.3.5.2.5.5.5.0 = INTEGER: 85
Results Explanation:
SNMPv2-SMI::enterprises.10734.3.5.2.5.5.1.0 = INTEGER: 27 = The chassis temperature (27° Celsius / 80.6° Fahrenheit)
SNMPv2-SMI::enterprises.10734.3.5.2.5.5.2.0 = INTEGER: 50 = The major threshold value
for
chassis temperature (50° Celsius / 122° Fahrenheit)
SNMPv2-SMI::enterprises.10734.3.5.2.5.5.3.0 = INTEGER: 55 = The critical threshold value of chassis temperature (55° Celsius / 131° Fahrenheit)
SNMPv2-SMI::enterprises.10734.3.5.2.5.5.4.0 = INTEGER: 0 = The minimum value of the chassis temperature range ( 0° Celsius / 32° Fahrenheit)
SNMPv2-SMI::enterprises.10734.3.5.2.5.5.5.0 = INTEGER: 85 = The maximum value of the chassis temperature range (85° Celsius / 185° Fahrenheit)
It is important to see that the device is pingable, does not have latency, packet loss, and the SNMP data is been collected.
Polling summary
The OPMANTEK monitoring system has the polling_summary tool, this will help us determine if the server takes a long time to collect the information from the nodes and cannot complete any operation, here we can see how many nodes have a late collection and a summary of the collected and uncollected nodes.
NMIS8
|
NMIS9
|
|
If the values are located in the x_late fields, we need to validate the performance of the server.
Services performance (Daemons)
NMIS is using some important services to make the solution work, sometimes devices stop working due to some of these services are interrupted, It is always a good idea to validate if those are running, to validate this you need to execute the next commands. This in order to provide even more security, as some of these services are crucial for the operation of the operating system. On the other hand, in systems like Unix or Linux, the services are also known as daemons. In this case, it is essential to validate the services that make up the OPMANTEK monitoring system (nmis).
|
Server hardware requirements.
This section is crucial to identify or resolve device issues, you need to review some considerations depending on the number of nodes you will manage, the number of users that will be accessing the GUI's, how often does your data need to be updated? If updates are required every 5 minutes, then you will need to have the hardware to be able to accomplish these requirements, also the OS Requirements need to be well defined a good rule of thumb is to reserve 1 GB of RAM for the OS by default, High-speed drives for the data (SAN is ideal) with separate storage for mongo database, and temp files. Anywhere between 4-8 cores with a high-performing processor(s), 16-64 GB RAM should be performing well for 1k+ Nodes.
Using top/htop command
The top command shows all running processes in the server. It shows you the system information and the processes information just like up-time, average load, tasks running, no. of users logged in, no. of CPU processes, RAM utilization and it lists all the processes running/utilized by the users in your server.
|
|
1.First line: Time and Load
The very first line of the top command indicates in the order below.
|
- current time (12:50:01)
- uptime of the machine (up 62 days, 22:56)
- users sessions logged in (5 users)
- average load on the system (load average: 4.76, 8.03, 4.34) the 3 values refer to the last minute, five minutes and 15 minutes ####### This is not good for the manager if we have high values
2. Second Row: task
The second row provides you the following information.
|
- Total Processes running (412 total)
- Running Processes (1 running)
- Sleeping Processes (411 sleeping)
- Stopped Processes (0 stopped)
- Processes waiting to be stopped from the parent process (15 zombies) ####### This is not good for the manager
Zombie Process: A process that has completed execution, but still has an entry in the process table. This entry still needs to allow the parent process to read its child exit status.
3. CPU section.
|
User processes of CPU in percentage(6.8%us)
System processes of CPU in percentage(3.8%sy)
Priority upgrade nice of CPU in percentage(0.2%ni)
Percentage of the CPU not used (74.4%id)
Processes waiting for I/O operations of CPU in percentage(28.2%wa) ####### This is not good for the server performance.
Serving hardware interrupts of CPU in percentage(0.1% hi — Hardware IRQ
Percentage of the CPU serving software interrupts (0.0% si — Software Interrupts
The amount of CPU ‘stolen’ from this virtual machine by the hypervisor for other tasks (such as running another virtual machine) will be 0 on desktop and server without Virtual machine. (0.0%st — Steal Time)
4. Memory
These rows will provide you the information about RAM usage. It shows you total memory in use, free, buffers cached.
|
|
5. Process List
There is the last row to discuss CPU usage which was running currently
|
- PID – ID of the process(26306)
- USER – The user that is the owner of the process (root)
- PR – priority of the process (20)
- NI – The “NICE” value of the process (0)
- VIRT – virtual memory used by the process (478m)
- RES – physical memory used from the process (3.3g)
- SHR – shared memory of the process (1900)
- S – indicates the status of the process: S=sleep R=running Z=zombie (S)
- %CPU – This is the percentage of CPU used by this process (3.9)####### This is not good for the server performance.
- %MEM – This is the percentage of RAM used by the process (1.3)####### This is not good for the server performance.
- TIME+ –This is the total time of activity of this process (0:08.21)####### This is not good for the server performance.
- COMMAND – And this is the name of the process (exim)
It is important to monitor this commando to see if the server is working properly executing all the internal processes need.
Server configuration options.
In order to tell the server, how to manage the devices configured we need to validate that all the configuration items are well set, you can see the server performances while collecting information going to the section, system>Host Diagnostics> NMIS Runtime Graph
if the total runtime/collect time is too high, we need to adjust the collect parameters depending on the manager version you are using.
NMIS 8 Processes
The main NMIS 8 process is called from different cron jobs to run different operations: collect, update, summary, clean jobs, etc. As an example:
|
The cron configuration can be found in /etc/crond.d/nmis.
For a collect or an update, the main thread is set up by default to fork worker processes to perform the requested operations using threads and improving performance. One of each operation will run every minute (by default), and will process as many nodes as the collect polling cycle is set up to process.
Configurations that affect performance
There are some important configurations that affect performace:
- abort_after: From NMIS 8.6.8G there is a new command line option, abort_after, that prevents the main thread to run for a long time, preventing it to collide with the next cron job. By default, this parameter is 60 seconds, as the cron job is set to run every 60 minutes by default.
Also, this option needs to always have also the option mthreads=true.
nmis8
/bin/nmis
.pl
type
=collect abort_after=60 mthread=
true
ignore_running=
true
;
- max_thread: The other important configuration option is max_thread, that will prevent the number of children of the main process to grow too big. Considerations:
- If the collect operation has a lot of nodes to process, the number of children won't reach the limit instantly. While the main thread is forking, the children complete their jobs and will exit. Also, the main process will wait for them to change their state so the number will increase slowly.
- NMIS can have more than one instance of the main process running, and the number of children could be higher than max_threads, as the limit is only per instance.
- sort_due_nodes: When NMIS decides what to poll it can do so in a pseudo-random order which is the default, if your server is overloaded you will likely see some nodes never getting polled, hence pseudo-random, so for heavily loaded servers, enable sort_due_nodes, in the NMIS configuration add with the value set to 1.
- Reference, NMIS 8 - Configuration Options for Server Performance Tuning
CROND file configuration (NMIS) and Config.nmis
Here we will proceed to verify the data collection configuration towards the devices, so we validate the Collect, maxthreads and mthread parameters.
In the NMIS Cron file we see the following:
Crond NMIS
|
We proceed to verify that the mthread value is activated and that the maxthreads has the same value in the Config.nmis file
Sección Config.nmis
|
We can see that the mthread value is deactivated and that the maxthreads value does correspond to the same one declared in the nmis cron, so we proceed to activate it and perform an update and collect to the node.
Update_Collect
|
Note: If these values declared in the cron and in the Conf.nmis file do not work, it is recommended to do the following:
Example Crond
|
The value of the maxthreads parameter (it is recommended to try between 50, 80 and 100) must be the same in both files (cron nmis and conf.nmis)
Apply the Update and Collect commands at the end of each test and verify the behavior in the NMIS GUI, this consists of reviewing the NMIS Runtime Graph, Network_summary and Polling_summary.
Configuration items for omk products
In low memory environments lowering the number of omkd workers provides the biggest improvement instability, even more than tuning mongod.conf does. The default value is 10, but in an environment, with low user concurrency, it can be decreased to 3-5.
|
Setting also omkd_max_requests, will help to have the threads restart gracefully before they get too big.
|
Process size safety limiter: if a max is configured and it's >= 256 mb and we're on linux, then run a process size check every 15 s and gracefully shut down the worker if over size.
|
Process maximum number of concurrent connections, defaults to 1000:
|
The performance logs are really useful for debugging purposes, but they also can affect performance. So, it is recommended to turn them off when they are not necessary:
|
NMIS8
NMIS 8 - Configuration Options for Server Performance Tuning
NIMS9
NMIS 9 - Configuration Options for Server Performance Tuning
Disk performance review.
This section is dedicated to identifying when the server is not writing all the data for the devices, this can have as a result graph with interruptions, so this causes level 2 problems (Severe impact - Unreliable production system) or even in some occasions level 1 (Critical for the business, complete loss of service, loss of data) to the client, so it is essential to determine what is happening and provide a diagnosis.
Server status at Service level.
The monitoring service is affected slowly when accessing the GUI, and its main impact is centered on the failure to execute collect and updates to the nodes, the CPUs are saturated and the monitoring system executes the collection of information every minute or 5 minutes, the system being overloaded is forced to kill the processes affecting the storage of the information of the nodes in the RRD's files
Node View in NMIS:
You will be able to visualize device graphs with gaps, this is an example of how to recognize this behavior.
NMIS Polling Summary (menu: System> Host Diagnostics> NMIS Polling Summary)
The Polling Summary option that NMIS is providing is very useful since in it we can see the details of the collection time of the nodes, active nodes, collected nodes, etc. These values must be according to the number of monitored nodes, likewise, the collection time must be within the range of minutes configured in the nmis crond.
Files system
It is important to validate that the file systems are free, if we have a FS full the tool will stop to work:
echo -e "\n \e[31m Información de espacio en el disco \e[0m" && df -h && echo -e "\n\n \e[31m Información de uso de RAM \e[0m" && free -m && echo -e "\n\n \e[31m Detalle de discos \e[0m" && fdisk -l
Resultado:
%wa- It is important to review the load average and iowait, if we see this values are high that represents problems for the server
List of processes with uninterruptible sleep state.
The ps command provides us with information about the processes of a Linux or Unix system.
Sometimes tasks can hang, go into a closed-loop, or stop responding. For other reasons, or they may continue to run, but gobble up too much CPU or RAM time, or behave in an equally antisocial manner. Sometimes tasks need to be removed as a mercy to everyone involved. The first step. Of course, it is to identify the process in question.
Processes in a "D" or uninterruptible sleep state are usually waiting on I/O.
|
Test Disk I/O Performance With dd Command
The dd command is very sensitive regarding the parameters it handles since it can cause serious problems on your server, OMK uses this command to obtain and measure server performance and latency, so with this, we determine that the writing speed and reading of the disc.
|
Parameters:
0.0X s to be correct.
0.X s, there is a warning (and there would be issue)
X.0 s would be critical (and there would be a problem).
Please note that one gigabyte was written for the test and 47 MB/s was the performance and the time it took to write the block was 0.223301 seconds from the server for this test.
Where:
- if=/dev/zero (if=/dev/input.file) : The name of the input file you want dd the read from.
- of=/data/omkTestFile (of=/path/to/output.file) : The name of the output file you want dd write the input.file to.
- bs=10M (bs = block-size): set the size of the block you want dd to use. Note that Linux will need free RAM space. If your test system doesn't have enough RAM available, use a smaller parameter for bs (like 128MB or 64MB, etc. or you can even test with 1, 2, or even 3 gigabytes).
- count=1 (count=number-of-blocks): The number of blocks you want dd to read.
- oflag=dsync (oflag=dsync) : Use synchronized I/O for data. Do not skip this option. This option get rid of caching and gives you good and accurate results
- conv=fdatasyn: Again, this tells dd to require a complete “sync” once, right before it exits. This option is equivalent to oflag=dsync.
Viewing disk usage information
This command helps us to monitor the load of an input and output device, observing the time that the devices are active in relation to the average of their transfer rates. It can also be used to compare activity between disks.
Using 100% iowait / Utilization indicates that there is a problem and in most cases a big problem that can even lead to data loss. Essentially, there is a bottleneck somewhere in the system. Perhaps one of the drives is preparing to die / fail.
OMK recommends executing the command in the following way, since this gives a better scenario than what happens with the disks.
Example: the command shows 5 samples made every 3 seconds, what we want is that at least 3 of the samples reflect data within the stable range for the server, otherwise this indicates that there is a problem with the disks.
|
This problem was solved with moving the MV to an environment with solid state disks, the client validated that the MV was using mechanical disks (HDD), so a clone of a laboratory MV does not work since it is presented the same Problem, when replacing HDD disks to solid state disks, the MV and the monitoring services stabilize, the RAM memory, CPU and disk utilization is normal, this according to the nodes that the monitoring system is monitoring .