This page is intended to provide a NMIS Device Troubleshooting Process to Identify bad behaviors in collection for NMIS8/9 products, you can break it down into clear steps that anyone can follow and identify what's wrong with the device collection also if we have Gaps In Graphs.
...
In order to tell the server, how to manage the devices configured we need to validate that all the configuration items are well set, you can see the server performances while collecting information going to the section, system>Host Diagnostics> NMIS Runtime Graph
The main NMIS 8 process is called from different cron jobs to run different operations: collect, update, summary, clean jobs, etc. As an example:
Code Block |
---|
* * * * * root /usr/local/nmis8/bin/nmis.pl type=collect abort_after=60 mthread=true ignore_running=true;
|
The cron configuration can be found in /etc/crond.d/nmis.
For a collect or an update, the main thread is set up by default to fork worker processes to perform the requested operations using threads and improving performance. One of each operation will run every minute (by default), and will process as many nodes as the collect polling cycle is set up to process.
Configurations that affect performance
There are some important configurations that affect performace:
- abort_after: From NMIS 8.6.8G there is a new command line option, abort_after, that prevents the main thread to run for a long time, preventing it to collide with the next cron job. By default, this parameter is 60 seconds, as the cron job is set to run every 60 minutes by default.
Also, this option needs to always have also the option mthreads=true.
Code Block nmis8/bin/nmis.pl type=collect abort_after=60 mthread=true ignore_running=true;
- max_thread: The other important configuration option is max_thread, that will prevent the number of children of the main process to grow too big. Considerations:
- If the collect operation has a lot of nodes to process, the number of children won't reach the limit instantly. While the main thread is forking, the children complete their jobs and will exit. Also, the main process will wait for them to change their state so the number will increase slowly.
- NMIS can have more than one instance of the main process running, and the number of children could be higher that max_threads, as the limit is only per instance.
- sort_due_nodes: When NMIS decides what to poll it can do so in a pseudo random order which is the default, if your server is overloaded you will likely see some nodes never getting polled, hence pseudo random, so for heavily loaded servers, enable sort_due_nodes, in the NMIS configuration add with the value set to 1.
Configuration items
In low memory environments lowering the number of omkd workers provides the biggest improvement in stability, even more than tuning mongod.conf does. The default value is 10, but in an environment with low users concurrency it can be decreased to 3-5.
Code Block |
---|
omkd_workers
|
Setting also omkd_max_requests, will help to have the threads restart gracefully before they get too big.
Code Block |
---|
omkd_max_requests
|
Process size safety limiter: if a max is configured and it's >= 256 mb and we're on linux, then run a process size check every 15 s and gracefully shut down the worker if over size.
Code Block |
---|
omkd_max_memory
|
Process maximum number of concurrent connections, defaults to 1000:
Code Block |
---|
omkd_max_clients
|
The performance logs are really useful for debugging purposes, but they also can affect performance. So, it is recommended to turn them off when they are not necessary:
Code Block |
---|
omkd_performance_logs => false
|
if the total runtime/collect time is too high, we need to adjust the collect parameters depending on the manager version you are using.
...
The ps command provides us with information about the processes of a Linux or Unix system.
Sometimes tasks can hang, go into a closed-loop, or stop responding. For other reasons, or they may continue to run, but gobble up too much CPU or RAM time, or behave in an equally antisocial manner. Sometimes tasks need to be removed as a mercy to everyone involved. The first step. Of course, it is to identify the process in question.
Processes in a "D" or uninterruptible sleep state are usually waiting on I/O.
Code Block |
---|
[root@nmisslvcc5 log]# ps -auxf | egrep " D| Z" Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ root 1563 0.1 0.0 0 0 ? D Mar17 10:47 \_ [jbd2/dm-2-8] root 1565 0.0 0.0 0 0 ? D Mar17 0:43 \_ [jbd2/dm-3-8] root 1615 0.3 0.0 0 0 ? D Mar17 39:26 \_ [flush-253:2] root 1853 0.0 0.0 29764 736 ? D<sl Mar17 0:04 auditd root 17898 0.0 0.0 103320 872 pts/5 S+ 12:20 0:00 | \_ egrep D| Z apache 17856 91.0 0.2 205896 76212 ? D 12:19 0:01 | \_ /usr/bin/perl /usr/local/nmis8/ root 13417 0.6 0.8 565512 306812 ? D 10:38 0:37 \_ opmantek.pl webserver - root 17833 9.8 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct> root 17838 10.3 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct> root 17842 10.6 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct> |
...