Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page is intended to provide a NMIS Device Troubleshooting Process to Identify bad behaviors in collection for NMIS8/9 products, you can break it down into clear steps that anyone can follow and identify what's wrong with the device collection also if we have Gaps In Graphs.

...

  • Metrics are important for the server,  NMIS would use Reachability, Availability and Health to represent the network. 
  • Reachability being the pingability of device,

  • Availability being (in the context of network gear) the interfaces which should be up, being up or not, e.g. interfaces which are "no shutdown" (ifAdminStatus = up) should be up, so a device with 10 interfaces of ifAdminStatus = up and ifOperStatus = up for 9 interfaces, the device would be 90% available.


  • Health is a composite metric, made up of many things depending on the device, router, CPU, memory. Something interesting here is that part of the health is made up of an inverse of interface utilisation, so an interface which has no utilisation will have a high health component, an interface which is highly utilised will reduce that metric. So the health is a reflection of load on the device, and will be very dynamic.


  • The overall metric of a device is a composite metric made up of weighted values of the other metrics being collected. The formula for this is based is configurable, so you can have weight Reachability to be higher than it currently is, or lower, your choice.


Image RemovedImage Added

For more references go to NMIS Metrics, Reachability, Availability and Health

...

You will be able to visualize device graphs with gaps, this is an example of how to recognize this behavior.

 Image Removed

Image Added


 

NMIS Polling Summary (menú: System > Host Diagnostics> NMIS Polling Summary)

...

The ps command provides us with information about the processes of a Linux or Unix system.
Sometimes tasks can hang, go into a closed-loop, or stop responding. For other reasons, or they may continue to run, but gobble up too much CPU or RAM time, or behave in an equally antisocial manner. Sometimes tasks need to be removed as a mercy to everyone involved. The first step. Of course, it is to identify the process in question.

Processes in a "D" or uninterruptible sleep state are usually waiting on I/O.


Code Block
[root@nmisslvcc5 log]# ps -auxf | egrep " D| Z"
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root      1563  0.1  0.0      0     0 ?        D    Mar17  10:47  \_ [jbd2/dm-2-8]
root      1565  0.0  0.0      0     0 ?        D    Mar17   0:43  \_ [jbd2/dm-3-8]
root      1615  0.3  0.0      0     0 ?        D    Mar17  39:26  \_ [flush-253:2]
root      1853  0.0  0.0  29764   736 ?        D<sl Mar17   0:04 auditd
root     17898  0.0  0.0 103320   872 pts/5    S+   12:20   0:00  |       \_ egrep  D| Z
apache   17856 91.0  0.2 205896 76212 ?        D    12:19   0:01  |   \_ /usr/bin/perl /usr/local/nmis8/
root     13417  0.6  0.8 565512 306812 ?       D    10:38   0:37  \_ opmantek.pl webserver             -
root     17833  9.8  0.0      0     0 ?        Z    12:19   0:00      \_ [opeventsd.pl] <defunct>
root     17838 10.3  0.0      0     0 ?        Z    12:19   0:00      \_ [opeventsd.pl] <defunct>
root     17842 10.6  0.0      0     0 ?        Z    12:19   0:00      \_ [opeventsd.pl] <defunct>

...