Page Comparison

This page is intended to provide a NMIS Device Troubleshooting Process to Identify bad behaviors in collection for NMIS8/9 products, you can break it down into clear steps that anyone can follow and identify what's wrong with the device collection also if we have Gaps In Graphs.

...

Metrics are important for the server, NMIS would use Reachability, Availability and Health to represent the network.
Reachability being the pingability of device,
Availability being (in the context of network gear) the interfaces which should be up, being up or not, e.g. interfaces which are "no shutdown" (ifAdminStatus = up) should be up, so a device with 10 interfaces of ifAdminStatus = up and ifOperStatus = up for 9 interfaces, the device would be 90% available.
Health is a composite metric, made up of many things depending on the device, router, CPU, memory. Something interesting here is that part of the health is made up of an inverse of interface utilisation, so an interface which has no utilisation will have a high health component, an interface which is highly utilised will reduce that metric. So the health is a reflection of load on the device, and will be very dynamic.
The overall metric of a device is a composite metric made up of weighted values of the other metrics being collected. The formula for this is based is configurable, so you can have weight Reachability to be higher than it currently is, or lower, your choice.

Image RemovedImage Added

For more references go to NMIS Metrics, Reachability, Availability and Health

...

Aplicar los comandos de Update y Collect al termino de cada prueba y verificar el comportamiento en la GUI de NMIS, esto consiste en revisar las gráficas de NMIS Runtime Graph, Network_summary y Polling_summary.

Check de servidor

Se requiere realizar una revisión general del hardware del servidor, para obtener lo siguiente:

Sistema Operativo.

Verificación del sistema operativo en donde se encuentra trabajando el sistema de monitoreo.

Comando:

Health Check

It is always a good idea to run a health check of the server to validate if all is good.

SO

To see which OS system is runnning

Command.

cat /etc/*release* && uname -osmr && uname -vResultado:

Code Block

[root@opmantek ~]# cat /etc/*release* && uname -osmr && uname -v
CentOS release 6.10 (Final)
CentOS release 6.10 (Final)
CentOS release 6.10 (Final)
cpe:/o:centos:linux:6:GA
Linux 2.6.32-754.28.1.el6.x86_64 x86_64 GNU/Linux
#1 SMP Wed Mar 11 18:38:45 UTC 2020
[root@opmantek ~]#

Detalles de almacenamiento:

Verificación del estado de almacenamiento, memoria libre y detalles de las particiones del S.O.

ComandoFile system

It is important to validate that the file systems are free, if we have a FS full the tool will stop to work:

echo -e "\n \e[31m Información de espacio en el disco \e[0m" && df -h && echo -e "\n\n \e[31m Información de uso de RAM \e[0m" && free -m && echo -e "\n\n \e[31m Detalle de discos \e[0m" && fdisk -l

...

Code Block

collapse	true

[root@opmantek ~]# echo -e "\n \e[31m Información de espacio en el disco \e[0m" && df -h && echo -e "\n\n \e[31m Información de uso de RAM \e[0m" && free -m && echo -e "\n\n \e[31m Detalle de discos \e[0m" && fdisk -l

  Información de espacio en el disco

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_nmis64-lv_root
                       59G  2.7G   54G   5% /
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/sda1             477M  109M  343M  25% /boot
/dev/mapper/vg_nmis64_data-lv_data
                      321G   11G  295G   4% /data
/dev/mapper/vg_nmis64-lv_var
                      147G  1.5G  138G   2% /var

  Información de uso de RAM
             total       used       free     shared    buffers     cached
Mem:          7984       6891       1093          0        216       1077
-/+ buffers/cache:       5596       2387
Swap:         4071       1589       2482

  Detalle de discos
Disk /dev/sda: 536.9 GB, 536870912000 bytes
255 heads, 63 sectors/track, 65270 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0008cec3

Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          64      512000   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2              64        5222    41430016   8e  Linux LVM
/dev/sda3            5222       42570   299997810   8e  Linux LVM
/dev/sda4           42570       65256   182225295   8e  Linux LVM

Disk /dev/sdb: 42.9 GB, 42949672960 bytes
255 heads, 63 sectors/track, 5221 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/vg_nmis64-lv_root: 64.4 GB, 64432898048 bytes
255 heads, 63 sectors/track, 7833 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/vg_nmis64-lv_swap: 4269 MB, 4269801472 bytes
255 heads, 63 sectors/track, 519 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/vg_nmis64_data-lv_data: 350.1 GB, 350140497920 bytes
255 heads, 63 sectors/track, 42568 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/mapper/vg_nmis64-lv_var: 160.3 GB, 160314687488 bytes
255 heads, 63 sectors/track, 19490 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

[root@opmantek ~]#

Análisis de TOP.

Se ejecuta el comando top y se visualiza que el servidor presenta lentitud, alto uso de CPU, valores elevados en el load average y iowait, este valor muestra cuánto tiempo pierde su CPU mientras espera que se completen las operaciones de E / S.%wa

It is important to review the load average and iowait, if we see this values are high that represents problems for the server

List of processes

The ps command provides us with information about the processes of a Linux or Unix system.
Sometimes tasks can hang, go into a closed-loop, or stop responding. For other reasons, or they may continue to run, but gobble up too much CPU or RAM time, or behave in an equally antisocial manner. Sometimes tasks need to be removed as a mercy to everyone involved. The first step. Of course, it is to identify the process in question.

Processes in a "D" or uninterruptible sleep state are usually waiting on I/O.

Code Block

[root@nmisslvcc5 log]# ps -auxf | egrep " D| Z"
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root      1563  0.1  0.0      0     0 ?        D    Mar17  10:47  \_ [jbd2/dm-2-8]
root      1565  0.0  0.0      0     0 ?        D    Mar17   0:43  \_ [jbd2/dm-3-8]
root      1615  0.3  0.0      0     0 ?        D    Mar17  39:26  \_ [flush-253:2]
root      1853  0.0  0.0  29764   736 ?        D<sl Mar17   0:04 auditd
root     17898  0.0  0.0 103320   872 pts/5    S+   12:20   0:00  |       \_ egrep  D| Z
apache   17856 91.0  0.2 205896 76212 ?        D    12:19   0:01  |   \_ /usr/bin/perl /usr/local/nmis8/
root     13417  0.6  0.8 565512 306812 ?       D    10:38   0:37  \_ opmantek.pl webserver             -
root     17833  9.8  0.0      0     0 ?        Z    12:19   0:00      \_ [opeventsd.pl] <defunct>
root     17838 10.3  0.0      0     0 ?        Z    12:19   0:00      \_ [opeventsd.pl] <defunct>
root     17842 10.6  0.0      0     0 ?        Z    12:19   0:00      \_ [opeventsd.pl] <defunct>

...

The dd command is very sensitive regarding the parameters it handles , since it can cause serious problems on your server, OMK uses this command to obtain and measure server performance and latency, so with this, we determine that the writing speed and reading of the disc.

Code Block

[root@SRVLXLIM32 ~]# dd if=/dev/zero of=/data/omkTestFile bs=10M count=1 oflag=direct
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0.980106 s, 15.0 MB/s
[root@SRVLXLIM32 ~]# dd if=/data/omkTestFile of=/dev/null 2>&1
20480+0 records in
20480+0 records out
10485760 bytes (10 MB) copied, 6.23595 s, 1.7 MB/s
[root@SRVLXLIM32 ~]#

Image Removed

OMK's base scale is as followsImage Added

Parameters:
0.0X s to be correct.
0.X s, there is a warning (and there would be
issue)
X.0 s would be critical (and there would be a problem).

...

The OPMANTEK monitoring system has the polling_summary tool, this will help us determine if the server takes a long time to collect the information from the nodes and cannot complete any operation, here we can see how many nodes have a late collection and a summary of the collected and uncollected nodes.Ejecución de script

NMIS8

Code Block
/usr/local/nmis8/admin/polling_summary.pl

NMIS9

Code Block
/usr/local/nmis9/admin/polling_summary9.pl

Code Block

[root@opmantek ~]#  /usr/local/nmis8/admin/polling_summary.pl
node                     attempt   status    ping  snmp  policy     delta  snmp avgdel  poll   update  pollmessage     
ACH-AIJ-DI-AL-SA6-0202010001-01 14:10:33  ontime    up    up    default    328    300  422.31  22.40  17.89                   
ACH-AIJ-RC-ET-08K-01     --:--:--  bad_snmp  up    up    default    ---    300  403.90  10.38  14.58   snmp never successful
ACH-ANA-RC-ET-08K-01     --:--:--  bad_snmp  up    down  default    ---    300  422.57  11.39  109.09  snmp never successful
ACH-ATU-RC-ET-08K-01     --:--:--  bad_snmp  up    up    default    ---    300  391.99  0.97   62.88   snmp never successful
ACH-CAB-DI-AL-SA6-0215010001-01 14:11:21  late      up    up    default    484    300  5543888.62 31.06  74.21   1x late poll    
ACH-CAB-DR-AL-P32-01     --:--:--  bad_snmp  up    up    default    ---    300  416.30  103.46 91.28   snmp never successful
ACH-CAB-GE-GM-G30-01     14:00:54  late      up    down  default    348    300  593.93  6.06   12.53   1x late poll    
ACH-CAB-RC-ET-08K-01     --:--:--  bad_snmp  up    up    default    ---    300  411.74  10.69  7.31    snmp never successful
ACH-CAB-TT-GM-30T-01     --:--:--  bad_snmp  up    down  default    ---    300  0.00    0.00   180.42  snmp never successful
ACH-CAR-RC-ET-08K-01     14:10:20  ontime    up    up    default    314    300  9054283.23 11.15  6.47                    
ACH-CAT-CN-AL-SA6-0212070008-01 14:07:39  late      up    up    default    600    300  27253590.83 12.39  22.23   1x late poll    
ACH-CAZ-TT-GM-30T-01     --:--:--  bad_snmp  up    down  default    ---    300  414.85  3.11   165.32  snmp never successful
ACH-CHM-DR-AL-P32-01     14:05:47  late      up    up    default    456    300  2686074.17 118.55 148.58  1x late poll    
ACH-CHM-GE-GM-G20-01     --:--:--  bad_snmp  up    down  default    ---    300  413.17  4.06   238.92  snmp never successful
ACH-CHM-RC-ET-09K-01     14:12:30  late      up    up    default    633    300  1983484.93 10.49  13.07   1x late poll    
ACH-CHM-TT-GM-20T-01     --:--:--  bad_snmp  up    down  default    ---    300  412.17  3.61   287.80  snmp never successful
ACH-COX-RC-ET-09K-01     13:51:14  late      up    up    default    473    300  22141.04 9.54   4.10    1x late poll    
ACH-CSM-RC-ET-08K-01     13:51:09  late      up    up    default    444    300  539117.26 11.25  5.31    1x late poll    
ACH-CSM-TT-GM-20T-01     14:08:34  late      up    down  default    709    300  1739800.92 4.01   229.73  1x late poll    
ACH-HCC-CN-AL-SA6-0212030012-01 13:50:33  ontime    up    up    default    330    300  8131293.53 23.65  23.84                   
ACH-HCC-RC-ET-08K-01     14:07:56  late      up    up    default    635    300  1802552.50 0.65   1.61    1x late poll    
ACH-HEY-DI-AL-SA6-0211010001-01 13:50:52  late      up    up    default    425    300  571.75  25.46  17.30   1x late poll    
ACH-HEY-DR-AL-P32-01     --:--:--  bad_snmp  up    up    default    ---    300  119099.96 106.25 120.92  snmp never successful
ACH-HEY-GE-GM-G20-01     --:--:--  bad_snmp  up    down  default    ---    300  0.00    0.00   112.37  snmp never successful
ACH-HEY-RC-ET-09K-01     --:--:--  bad_snmp  up    up    default    ---    300  404.62  11.01  7.49    snmp never successful
--Snip--
--Snip--
UCA-PUC-DR-AL-P32-01     14:12:04  late      up    up    default    524    300  124010.73 135.20 124.79  1x late poll    
UCA-PUC-GE-GM-G30-01     14:11:20  late      up    down  default    475    300  3868910.82 3.68   236.48  1x late poll    
UCA-PUC-GE-GM-G30-02     14:12:32  late      up    down  default    644    300  3871900.66 4.05   209.92  1x late poll    
UCA-PUC-RC-ET-09K-01     --:--:--  bad_snmp  up    up    default    ---    300  418.17  10.83  5.76    snmp never successful
UCA-PUC-TT-GM-30A-01     --:--:--  bad_snmp  up    down  default    ---    300  397.68  4.21   215.65  snmp never successful
UCA-PUC-TT-GM-30A-02     14:13:03  late      up    down  default    720    300  329362.60 3.39   208.92  1x late poll    
CC_VITATRAC_GT_Z2_MAZATE 14:13:04  demoted   down  down  default    ---    300  0.00    2.22   0.80    s
CC_VITATRAC_GT_Z3_COBAN  14:13:12  late      up    up    default    618    300  4874416.57 1.91   4.46
CC_VITATRAC_GT_Z3_ESCUINTLA 14:13:12  late      up    up    default    604    300  4902673.92 2.17   4.8
CC_VITATRAC_GT_Z7_BODEGA_MATEO 14:15:37  late      up    up    default    642    300  3844049.73 3.25
CC_VITATRAC_GT_Z8_MIXCO  14:15:42  late      up    up    default    634    300  4959081.87 2.47   6.70
CC_VITATRAC_GT_Z9_XELA   14:16:03  late      up    up    default    634    300  3943302.62 8.95   58.61
CC_VITATRAC_GT_ZONA_PRADERA 14:17:47  demoted   up    down  default    711    300  605.21  10.91  10.28
CC_VIVATEX_GT_INTERNET_VILLA_NUEVA 14:18:49  late      up    up    default    979    300  4563376.03 1.2
CC_VOLCAN_STA_MARIA_GT_INTERNET_CRUCE_BARCENAS 14:19:44  late      up    up    default    981    300  44late poll
nmisslvcc5               14:18:55  late      up    up    default    344    300  376209.90 2.33   1.23

totalNodes=2615 totalPoll=2267 ontime=73 pingOnly=0 1x_late=2190 3x_late=3 12x_late=1 144x_late=0
time=10:10:07 pingDown=354 snmpDown=359 badSnmp=295 noSnmp=0 demoted=348
[root@opmantek ~]#

if the values are located in the x_late fields, we need to validate the performance of the server.

Viewing disk usage information

...

Versions Compared

Old Version 16

New Version 17

Key

Check de servidor

Health Check

List of processes

Viewing disk usage information