Page Comparison

This page is intended to provide a NMIS Device Troubleshooting Process to Identify bad behaviors in collection for NMIS8/9 products, you can break it down into clear steps that anyone can follow and identify what's wrong with the device collection also if we have Gaps In Graphs.

...

This section is focused on performing the review and validation of the server status in general, we will focus on verifying the historical behavior of the main metrics for the server, it is important to review all the metrics related to the good performance between the server and devices

Verifying Health Metrics

Metrics are important for the server, NMIS would use Reachability, Availability and Health to represent the network.
Reachability being the pingability of device,
Availability being (in the context of network gear) the interfaces which should be up, being up or not, e.g. interfaces which are "no shutdown" (ifAdminStatus = up) should be up, so a device with 10 interfaces of ifAdminStatus = up and ifOperStatus = up for 9 interfaces, the device would be 90% available.
Health is a composite metric, made up of many things depending on the device, router, CPU, memory. Something interesting here is that part of the health is made up of an inverse of interface utilisation, so an interface which has no utilisation will have a high health component, an interface which is highly utilised will reduce that metric. So the health is a reflection of load on the device, and will be very dynamic.
The overall metric of a device is a composite metric made up of weighted values of the other metrics being collected. The formula for this is based is configurable, so you can have weight Reachability to be higher than it currently is, or lower, your choice.

...

Viewing the graphs referring to the network performance as (Response Time in milliseconds, IP Utilization, TCP Connection, TCP Segments) will help us to identify the behavior of the server/network in a period of 2 days, we can modify this period time to see more data if needed.

Device configuration.

It is important to validate if the problem occurs in the network or is something related to the device configuration, in order to identify what's happening we need to validate the next commands from the console server.

...

In order to tell the server, how to manage the devices configured we need to validate that all the configuration items are well set, you can see the server performances while collecting information going to the section, system>Host Diagnostics> NMIS Runtime Graph

if the total runtime/collect time is too high, we need to adjust the collect parameters depending on the manager version you are using.

NMIS 8 Processes

The main NMIS 8 process is called from different cron jobs to run different operations: collect, update, summary, clean jobs, etc. As an example:

Code Block
* * * * * root /usr/local/nmis8/bin/nmis.pl type=collect abort_after=60 mthread=true ignore_running=true;

The cron configuration can be found in /etc/crond.d/nmis.

For a collect or an update, the main thread is set up by default to fork worker processes to perform the requested operations using threads and improving performance. One of each operation will run every minute (by default), and will process as many nodes as the collect polling cycle is set up to process.

Configurations that affect performance

There are some important configurations that affect performace:

abort_after: From NMIS 8.6.8G there is a new command line option, abort_after, that prevents the main thread to run for a long time, preventing it to collide with the next cron job. By default, this parameter is 60 seconds, as the cron job is set to run every 60 minutes by default.
Also, this option needs to always have also the option mthreads=true.
Code Block
nmis8/bin/nmis.pl type=collect abort_after=60 mthread=true ignore_running=true;
max_thread: The other important configuration option is max_thread, that will prevent the number of children of the main process to grow too big. Considerations:
- If the collect operation has a lot of nodes to process, the number of children won't reach the limit instantly. While the main thread is forking, the children complete their jobs and will exit. Also, the main process will wait for them to change their state so the number will increase slowly.
- NMIS can have more than one instance of the main process running, and the number of children could be higher that max_threads, as the limit is only per instance.

sort_due_nodes: When NMIS decides what to poll it can do so in a pseudo random order which is the default, if your server is overloaded you will likely see some nodes never getting polled, hence pseudo random, so for heavily loaded servers, enable sort_due_nodes, in the NMIS configuration add with the value set to 1.

Configuration items

In low memory environments lowering the number of omkd workers provides the biggest improvement in stability, even more than tuning mongod.conf does. The default value is 10, but in an environment with low users concurrency it can be decreased to 3-5.

Code Block
omkd_workers

Setting also omkd_max_requests, will help to have the threads restart gracefully before they get too big.

Code Block
omkd_max_requests

Process size safety limiter: if a max is configured and it's >= 256 mb and we're on linux, then run a process size check every 15 s and gracefully shut down the worker if over size.

Code Block
omkd_max_memory

Process maximum number of concurrent connections, defaults to 1000:

Code Block
omkd_max_clients

The performance logs are really useful for debugging purposes, but they also can affect performance. So, it is recommended to turn them off when they are not necessary:

Code Block
omkd_performance_logs => false

...

NMIS8

NMIS 8 - Configuration Options for Server Performance Tuning

...

Aplicar los comandos de Update y Collect al termino de cada prueba y verificar el comportamiento en la GUI de NMIS, esto consiste en revisar las gráficas de NMIS Runtime Graph, Network_summary y Polling_summary.

Health Check

It is always a good idea to run a health check of the server to validate if all is good.

...

It is important to review the load average and iowait, if we see this values are high that represents problems for the server

List of processes

The ps command provides us with information about the processes of a Linux or Unix system.
Sometimes tasks can hang, go into a closed-loop, or stop responding. For other reasons, or they may continue to run, but gobble up too much CPU or RAM time, or behave in an equally antisocial manner. Sometimes tasks need to be removed as a mercy to everyone involved. The first step. Of course, it is to identify the process in question.

Processes in a "D" or uninterruptible sleep state are usually waiting on I/O.

Code Block

[root@nmisslvcc5 log]# ps -auxf | egrep " D| Z"
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root      1563  0.1  0.0      0     0 ?        D    Mar17  10:47  \_ [jbd2/dm-2-8]
root      1565  0.0  0.0      0     0 ?        D    Mar17   0:43  \_ [jbd2/dm-3-8]
root      1615  0.3  0.0      0     0 ?        D    Mar17  39:26  \_ [flush-253:2]
root      1853  0.0  0.0  29764   736 ?        D<sl Mar17   0:04 auditd
root     17898  0.0  0.0 103320   872 pts/5    S+   12:20   0:00  |       \_ egrep  D| Z
apache   17856 91.0  0.2 205896 76212 ?        D    12:19   0:01  |   \_ /usr/bin/perl /usr/local/nmis8/
root     13417  0.6  0.8 565512 306812 ?       D    10:38   0:37  \_ opmantek.pl webserver             -
root     17833  9.8  0.0      0     0 ?        Z    12:19   0:00      \_ [opeventsd.pl] <defunct>
root     17838 10.3  0.0      0     0 ?        Z    12:19   0:00      \_ [opeventsd.pl] <defunct>
root     17842 10.6  0.0      0     0 ?        Z    12:19   0:00      \_ [opeventsd.pl] <defunct>

Test Disk I/O Performance With dd Command

The dd command is very sensitive regarding the parameters it handles since it can cause serious problems on your server, OMK uses this command to obtain and measure server performance and latency, so with this, we determine that the writing speed and reading of the disc.

...

Parameters:
0.0X s to be correct.
0.X s, there is a warning (and there would be
issue)
X.0 s would be critical (and there would be a problem).

...

if=/dev/zero (if=/dev/input.file) : The name of the input file you want dd the read from.
of=/data/omkTestFile (of=/path/to/output.file) : The name of the output file you want dd write the input.file to.
bs=10M (bs = block-size): set the size of the block you want dd to use. Note that Linux will need free RAM space. If your test system doesn't have enough RAM available, use a smaller parameter for bs (like 128MB or 64MB, etc. or you can even test with 1, 2, or even 3 gigabytes).
count=1 (count=number-of-blocks): The number of blocks you want dd to read.
oflag=dsync (oflag=dsync) : Use synchronized I/O for data. Do not skip this option. This option get rid of caching and gives you good and accurate results
conv=fdatasyn: Again, this tells dd to require a complete “sync” once, right before it exits. This option is equivalent to oflag=dsync.

Polling summary

The OPMANTEK monitoring system has the polling_summary tool, this will help us determine if the server takes a long time to collect the information from the nodes and cannot complete any operation, here we can see how many nodes have a late collection and a summary of the collected and uncollected nodes.

...

if the values are located in the x_late fields, we need to validate the performance of the server.

Viewing disk usage information

This command helps us to monitor the load of an input and output device, observing the time that the devices are active in relation to the average of their transfer rates. It can also be used to compare activity between disks.

...

Versions Compared

Old Version 20

New Version 21

Key