This page is intended to provide a NMIS Device Troubleshooting Process to Identify bad behaviors in collection for NMIS8/9 products, you can break it down into clear steps that anyone can follow and identify what's wrong with the device collection also if we have Gaps In Graphs.
...
Device Troubleshooting Process
- Identify the problem. The first step in troubleshooting a device issue is to identify the problem, you have to consider if the issue is in NMIS8 or NMIS9 products.
- Add to the support the case the product version and the servers/devices/models involved.
- What kind of problem are you observing. A device issue can be affected for the next reasons.
- Network performance, latency in the network, layer 1,2, and 3 issues.
- Device configuration, connectivity, SNMP configuration, and others.
- Server hardware requirements, high resource utilization parameters in the server.
- Server configuration options, missing configuration items for server tunning.
- Disk performance, slow write/read times for the device collection.
- Gather information, collect all the graphs, images, behaviors that can explain what the problem is.
- Collect support tool files The Opmantek Support Tool
Execute the collect command for the support tool
Code Block #General collection. /usr/local/nmis8/admin/support.pl action=collect #If the file is big, we can add the next parameter. /usr/local/nmis8/admin/support.pl action=collect maxzipsize=900000000 #Device collection. /usr/local/nmis8/admin/support.pl action=collect node=<node_name> maxzipsize=900000000
- If you are using NMIS8, provide the /usr/local/nmis8/var files
go to /usr/local/nmis8/var directory and collect the next files
Code Block -rw-rw---- 1 nmis nmis 4292 Apr 5 18:26 <node_name>-node.json -rw-rw---- 1 nmis nmis 2695 Apr 5 18:26 <node_name>-view.json
obtain update/collect outputs this information will upload to the support case:
Code Block /usr/local/nmis8/bin/nmis.pl type=update node=<node_name> model=true debug=9 force=true > /tmp/node_name_update_$(hostname).log /usr/local/nmis8/bin/nmis.pl type=collect node=<node_name> model=true debug=9 force=true > /tmp/node_name_collect_$(hostname).log
- Collect support tool files The Opmantek Support Tool
- Replicate the problem. If possible you have to define, what the steps are to replicate the problem.
- Identify symptoms. To this point, you are able to see a specific problem and what the symptoms are.
- Determinate if something has changed, is important to verify with your team if something has changed, a good way to see this behavior is monitoring the performance graph for devices and server
...
You will be able to visualize device graphs with gaps, this is an example of how to recognize this behavior.
NMIS Polling Summary (menúmenu: System > System> Host Diagnostics> NMIS Polling Summary)
El resumen de poleo que proporciona Nmis es muy útil, ya que en el podremos ver los detalles de tiempo de colección de los nodos, nodos activos, nodos colectados, etc,. Estos valores deben de estar acorde a los numero de nodos monitoreados, asi mismo el tiempo de colección debe estar entre el rango de minutos configurados en el cron de nmisThe Polling Summary that Nmis provides is very useful, since in it we can see the details of the collection time of the nodes, active nodes, collected nodes, etc. These values must be according to the number of monitored nodes, likewise the collection time must be within the range of minutes configured in the nmis crond.
Network Metrics and Health (menúmenu: Network Status > Status> Network Metrics and Health.)
Aquí podremos validar la disponibilidad del servidor, estado de salud y alcance, estos valores debe de estar por encima de 60 para considerar que se esta trabajando bien, aquí se ven cortes por lo que indica que no se colecto información en ese lapso de tiempo.
Configuracion de archivo Cron (nmis) y Config.nmis
Aquí se procederá a verificar la configuracion de colección de datos hacia los dispositivos, por lo que validamos los parámetros de Collect, maxthreads y mthread.
En el archivo Cron de nmis vemos lo siguienteHere we can validate the availability of the server, health status and scope, these values must be above 60 to consider that it is working well, here there are cuts so it indicates that information was not collected in that period of time.
Crond file configuration (nmis) and Config.nmis
Here we will proceed to verify the data collection configuration towards the devices, so we validate the Collect, maxthreads and mthread parameters.
In the nmis Cron file we see the following:
Code Block | ||
---|---|---|
| ||
###################################################### # NMIS8 Config ###################################################### # Run Full Statistics Collection */5 * * * * root /usr/local/nmis8/bin/nmis.pl type=collect maxthreads=100 mthread=true */5 * * * * root /usr/local/nmis8/bin/nmis.pl type=services mthread=true # ###################################################### # Optionally run a more frequent Services-only Collection # */3 * * * * root /usr/local/nmis8/bin/nmis.pl type=services mthread=true ###################################################### # Run Summary Update every 2 minutes */2 * * * * root /usr/local/nmis8/bin/nmis.pl type=summary |
Procedemos a verificar que el valor del mthread este activado y que el maxthreads tenga en mismo valor en el archivo Config.nmisWe proceed to verify that the mthread value is activated and that the maxthreads has the same value in the Config.nmis file
Code Block | ||
---|---|---|
| ||
'nmis_group' => 'nmis', 'nmis_host' => 'nmissTest_OMK.omk.com', 'nmis_host_protocol' => 'http', 'nmis_maxthreads' => '100', 'nmis_mthread' => 'false', 'nmis_summary_poll_cycle' => 'false', 'nmis_user' => 'nmis', |
Podemos ver que el valor mthread esta desactivado y que el valor maxthreads si corresponde al mismo declarado en el cron de nmis, por lo que se procede a activarlo y a realizar un update y collect al nodoWe can see that the mthread value is deactivated and that the maxthreads value does correspond to the same one declared in the nmis cron, so we proceed to activate it and perform an update and collect to the node.
Code Block | ||
---|---|---|
| ||
/usr/local/nmis8/bin/nmis.pl type=update node=<Name_Node> force=true /usr/local/nmis8/bin/nmis.pl type=collect node=<Name_Node> force=true |
Nota: Si estos valores declarados en el cron y en el archivo Conf.nmis no funcionan se recomienda realizar lo siguienteNote: If these values declared in the cron and in the Conf.nmis file do not work, it is recommended to do the following:
Code Block | ||
---|---|---|
| ||
# Ejemplo 1: /usr/local/nmis8/bin/nmis.pl type=collect abort_after=300 mthread=true maxthreads=100 ignore_running=true # Ejemplo 2 /usr/local/nmis8/bin/nmis.pl type=collect abort_after=240 mthread=true maxthreads=100 ignore_running=true |
El valor del parámetro maxthreads (se recomienda probar entre The value of the maxthreads parameter (it is recommended to try between 50, 80 y and 100) debe ser el mismo en ambos archivos must be the same in both files (cron nmis y and conf.nmis)
Aplicar los comandos de Update y Collect al termino de cada prueba y verificar el comportamiento en la GUI de NMIS, esto consiste en revisar las gráficas de Apply the Update and Collect commands at the end of each test and verify the behavior in the NMIS GUI, this consists of reviewing the NMIS Runtime Graph, Network_summary y and Polling_summary.
Health Check
...
The ps command provides us with information about the processes of a Linux or Unix system.
Sometimes tasks can hang, go into a closed-loop, or stop responding. For other reasons, or they may continue to run, but gobble up too much CPU or RAM time, or behave in an equally antisocial manner. Sometimes tasks need to be removed as a mercy to everyone involved. The first step. Of course, it is to identify the process in question.
Processes in a "D" or uninterruptible sleep state are usually waiting on I/O.
Code Block |
---|
[root@nmisslvcc5 log]# ps -auxf | egrep " D| Z" Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ root 1563 0.1 0.0 0 0 ? D Mar17 10:47 \_ [jbd2/dm-2-8] root 1565 0.0 0.0 0 0 ? D Mar17 0:43 \_ [jbd2/dm-3-8] root 1615 0.3 0.0 0 0 ? D Mar17 39:26 \_ [flush-253:2] root 1853 0.0 0.0 29764 736 ? D<sl Mar17 0:04 auditd root 17898 0.0 0.0 103320 872 pts/5 S+ 12:20 0:00 | \_ egrep D| Z apache 17856 91.0 0.2 205896 76212 ? D 12:19 0:01 | \_ /usr/bin/perl /usr/local/nmis8/ root 13417 0.6 0.8 565512 306812 ? D 10:38 0:37 \_ opmantek.pl webserver - root 17833 9.8 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct> root 17838 10.3 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct> root 17842 10.6 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct> |
...