This page is intended to provide a NMIS Device Troubleshooting Process to Identify bad behaviors in collection for NMIS8/9 products, you can break it down into clear steps that anyone can follow and identify what's wrong with the device collection also if we have Gaps In Graphs.
...
- Metrics are important for the server, NMIS would use Reachability, Availability and Health to represent the network.
Reachability being the pingability of device,
Availability being (in the context of network gear) the interfaces which should be up, being up or not, e.g. interfaces which are "no shutdown" (ifAdminStatus = up) should be up, so a device with 10 interfaces of ifAdminStatus = up and ifOperStatus = up for 9 interfaces, the device would be 90% available.
Health is a composite metric, made up of many things depending on the device, router, CPU, memory. Something interesting here is that part of the health is made up of an inverse of interface utilisation, so an interface which has no utilisation will have a high health component, an interface which is highly utilised will reduce that metric. So the health is a reflection of load on the device, and will be very dynamic.
The overall metric of a device is a composite metric made up of weighted values of the other metrics being collected. The formula for this is based is configurable, so you can have weight Reachability to be higher than it currently is, or lower, your choice.
For more references go to NMIS Metrics, Reachability, Availability and Health
...
Se ejecuta el comando top y se visualiza que el servidor presenta lentitud, alto uso de CPU, valores elevados en el load average y iowait, este valor muestra cuánto tiempo pierde su CPU mientras espera que se completen las operaciones de E / S.
Análisis de operaciones lentas de lectura / escritura en disco, red, IPC
El comando dd es muy sensible respecto a los parámetros que maneja, ya que puede ocasionar serios problemas en su servidor, OMK emplea este comando para obtener y medir el rendimiento del servidor y la latencia, por lo que con esto determinamos que la velocidad de escritura y lectura del disco.
Code Block |
---|
[root@SRVLXLIM32 ~]# dd if=/dev/zero of=/data/omkTestFile bs=10M count=1 oflag=direct
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0.980106 s, 15.0 MB/s
[root@SRVLXLIM32 ~]# dd if=/data/omkTestFile of=/dev/null 2>&1
20480+0 records in
20480+0 records out
10485760 bytes (10 MB) copied, 6.23595 s, 1.7 MB/s
[root@SRVLXLIM32 ~]#
|
Tenga en cuenta que se escribió 10 Megabytes para la prueba y 47 MB / s fue el rendimiento y el tiempo que le tomo escribir el bloque fue de 0.223301 segundos del servidor para esta prueba.
Dónde:
if = /dev/zero (if=/dev/input.file): El nombre del archivo de entrada desde el que desea leer.
of = /data/omkTestFile (of=/path/to/output.file): El nombre del archivo de salida en el que desea escribir el archivo de entrada.
bs = 10M (bs=block-size): establezca el tamaño del bloque que desea que use dd. Tenga en cuenta que Linux necesitará espacio libre en RAM. Si su sistema de prueba no tiene suficiente RAM disponible, use un parámetro más pequeño para bs (como 128 MB o 64 MB, etc. o incluso puede probar con 1, 2 o hasta 3 gigabyte).
count = 1 (count=number-of-blocks): El número de bloques que desea que dd lea.
oflag = dsync (oflag=dsync): utilice E / S sincronizadas para los datos. No omita esta opción. Esta opción elimina el almacenamiento en caché y le brinda resultados buenos y precisos
conv = fdatasyn: De nuevo, esto le dice a dd que requiera una “sincronización” completa una vez, justo antes de salir. Esta opción es equivalente a oflag=dsync.
Ejecución de script /usr/local/nmis8/admin/polling_summary.pl
Code Block |
---|
totalNodes=985 totalPoll=962 ontime=850 pingOnly=0 1x_late=111 3x_late=0 12x_late=1 144x_late=0
time=17:53:14 pingDown=14 snmpDown=250 badSnmp=23 noSnmp=0 demoted=0
[root@SRVLXLIM32 ~]#
|
...
List of processes
The ps command provides us with information about the processes of a Linux or Unix system.
Sometimes tasks can hang, go into a closed loop, or stop responding. For other reasons, or they may continue to run, but gobble up too much CPU or RAM time, or behave in an equally antisocial manner. Sometimes tasks need to be removed as a mercy to everyone involved. The first step. Of course, it is to identify the process in question.
Processes in a "D" or uninterruptible sleep state are usually waiting on I/O.
Code Block |
---|
[root@nmisslvcc5 log]# ps -auxf | egrep " D| Z"
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root 1563 0.1 0.0 0 0 ? D Mar17 10:47 \_ [jbd2/dm-2-8]
root 1565 0.0 0.0 0 0 ? D Mar17 0:43 \_ [jbd2/dm-3-8]
root 1615 0.3 0.0 0 0 ? D Mar17 39:26 \_ [flush-253:2]
root 1853 0.0 0.0 29764 736 ? D<sl Mar17 0:04 auditd
root 17898 0.0 0.0 103320 872 pts/5 S+ 12:20 0:00 | \_ egrep D| Z
apache 17856 91.0 0.2 205896 76212 ? D 12:19 0:01 | \_ /usr/bin/perl /usr/local/nmis8/
root 13417 0.6 0.8 565512 306812 ? D 10:38 0:37 \_ opmantek.pl webserver -
root 17833 9.8 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct>
root 17838 10.3 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct>
root 17842 10.6 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct> |
Test Disk I/O Performance With dd Command
The dd command is very sensitive regarding the parameters it handles, since it can cause serious problems on your server, OMK uses this command to obtain and measure server performance and latency, so with this we determine that the writing speed and reading the disc.
Code Block |
---|
[root@SRVLXLIM32 ~]# dd if=/dev/zero of=/data/omkTestFile bs=10M count=1 oflag=direct
1+0 records in
1+0 records out
10485760 bytes (10 MB) copied, 0.980106 s, 15.0 MB/s
[root@SRVLXLIM32 ~]# dd if=/data/omkTestFile of=/dev/null 2>&1
20480+0 records in
20480+0 records out
10485760 bytes (10 MB) copied, 6.23595 s, 1.7 MB/s
[root@SRVLXLIM32 ~]#
|
OMK's base scale is as follows:
0.0X s to be correct.
0.X s, there is a warning (and there would be
issue)
X.0 s would be critical (and there would be a problem).
Please note that one gigabyte was written for the test and 47 MB/s was the performance and the time it took to write the block was 0.223301 seconds from the server for this test.
Where:
- if=/dev/zero (if=/dev/input.file) : The name of the input file you want dd the read from.
- of=/data/omkTestFile (of=/path/to/output.file) : The name of the output file you want dd write the input.file to.
- bs=10M (bs = block-size): set the size of the block you want dd to use. Note that Linux will need free RAM space. If your test system doesn't have enough RAM available, use a smaller parameter for bs (like 128MB or 64MB, etc. or you can even test with 1, 2, or even 3 gigabytes).
- count=1 (count=number-of-blocks): The number of blocks you want dd to read.
- oflag=dsync (oflag=dsync) : Use synchronized I/O for data. Do not skip this option. This option get rid of caching and gives you good and accurate results
- conv=fdatasyn: Again, this tells dd to require a complete “sync” once, right before it exits. This option is equivalent to oflag=dsync.
Polling summary
The OPMANTEK monitoring system has the polling_summary tool, this will help us determine if the server takes a long time to collect the information from the nodes and cannot complete any operation, here we can see how many nodes have a late collection and a summary of the collected and uncollected nodes.
Ejecución de script /usr/local/nmis8/admin/polling_summary.pl
Code Block |
---|
[root@opmantek ~]# /usr/local/nmis8/admin/polling_summary.pl
node attempt status ping snmp policy delta snmp avgdel poll update pollmessage
ACH-AIJ-DI-AL-SA6-0202010001-01 14:10:33 ontime up up default 328 300 422.31 22.40 17.89
ACH-AIJ-RC-ET-08K-01 --:--:-- bad_snmp up up default --- 300 403.90 10.38 14.58 snmp never successful
ACH-ANA-RC-ET-08K-01 --:--:-- bad_snmp up down default --- 300 422.57 11.39 109.09 snmp never successful
ACH-ATU-RC-ET-08K-01 --:--:-- bad_snmp up up default --- 300 391.99 0.97 62.88 snmp never successful
ACH-CAB-DI-AL-SA6-0215010001-01 14:11:21 late up up default 484 300 5543888.62 31.06 74.21 1x late poll
ACH-CAB-DR-AL-P32-01 --:--:-- bad_snmp up up default --- 300 416.30 103.46 91.28 snmp never successful
ACH-CAB-GE-GM-G30-01 14:00:54 late up down default 348 300 593.93 6.06 12.53 1x late poll
ACH-CAB-RC-ET-08K-01 --:--:-- bad_snmp up up default --- 300 411.74 10.69 7.31 snmp never successful
ACH-CAB-TT-GM-30T-01 --:--:-- bad_snmp up down default --- 300 0.00 0.00 180.42 snmp never successful
ACH-CAR-RC-ET-08K-01 14:10:20 ontime up up default 314 300 9054283.23 11.15 6.47
ACH-CAT-CN-AL-SA6-0212070008-01 14:07:39 late up up default 600 300 27253590.83 12.39 22.23 1x late poll
ACH-CAZ-TT-GM-30T-01 --:--:-- bad_snmp up down default --- 300 414.85 3.11 165.32 snmp never successful
ACH-CHM-DR-AL-P32-01 14:05:47 late up up default 456 300 2686074.17 118.55 148.58 1x late poll
ACH-CHM-GE-GM-G20-01 --:--:-- bad_snmp up down default --- 300 413.17 4.06 238.92 snmp never successful
ACH-CHM-RC-ET-09K-01 14:12:30 late up up default 633 300 1983484.93 10.49 13.07 1x late poll
ACH-CHM-TT-GM-20T-01 --:--:-- bad_snmp up down default --- 300 412.17 3.61 287.80 snmp never successful
ACH-COX-RC-ET-09K-01 13:51:14 late up up default 473 300 22141.04 9.54 4.10 1x late poll
ACH-CSM-RC-ET-08K-01 13:51:09 late up up default 444 300 539117.26 11.25 5.31 1x late poll
ACH-CSM-TT-GM-20T-01 14:08:34 late up down default 709 300 1739800.92 4.01 229.73 1x late poll
ACH-HCC-CN-AL-SA6-0212030012-01 13:50:33 ontime up up default 330 300 8131293.53 23.65 23.84
ACH-HCC-RC-ET-08K-01 14:07:56 late up up default 635 300 1802552.50 0.65 1.61 1x late poll
ACH-HEY-DI-AL-SA6-0211010001-01 13:50:52 late up up default 425 300 571.75 25.46 17.30 1x late poll
ACH-HEY-DR-AL-P32-01 --:--:-- bad_snmp up up default --- 300 119099.96 106.25 120.92 snmp never successful
ACH-HEY-GE-GM-G20-01 --:--:-- bad_snmp up down default --- 300 0.00 0.00 112.37 snmp never successful
ACH-HEY-RC-ET-09K-01 --:--:-- bad_snmp up up default --- 300 404.62 11.01 7.49 snmp never successful
--Snip--
--Snip--
UCA-PUC-DR-AL-P32-01 14:12:04 late up up default 524 300 124010.73 135.20 124.79 1x late poll
UCA-PUC-GE-GM-G30-01 14:11:20 late up down default 475 300 3868910.82 3.68 236.48 1x late poll
UCA-PUC-GE-GM-G30-02 14:12:32 late up down default 644 300 3871900.66 4.05 209.92 1x late poll
UCA-PUC-RC-ET-09K-01 --:--:-- bad_snmp up up default --- 300 418.17 10.83 5.76 snmp never successful
UCA-PUC-TT-GM-30A-01 --:--:-- bad_snmp up down default --- 300 397.68 4.21 215.65 snmp never successful
UCA-PUC-TT-GM-30A-02 14:13:03 late up down default 720 300 329362.60 3.39 208.92 1x late poll
CC_VITATRAC_GT_Z2_MAZATE 14:13:04 demoted down down default --- 300 0.00 2.22 0.80 s
CC_VITATRAC_GT_Z3_COBAN 14:13:12 late up up default 618 300 4874416.57 1.91 4.46
CC_VITATRAC_GT_Z3_ESCUINTLA 14:13:12 late up up default 604 300 4902673.92 2.17 4.8
CC_VITATRAC_GT_Z7_BODEGA_MATEO 14:15:37 late up up default 642 300 3844049.73 3.25
CC_VITATRAC_GT_Z8_MIXCO 14:15:42 late up up default 634 300 4959081.87 2.47 6.70
CC_VITATRAC_GT_Z9_XELA 14:16:03 late up up default 634 300 3943302.62 8.95 58.61
CC_VITATRAC_GT_ZONA_PRADERA 14:17:47 demoted up down default 711 300 605.21 10.91 10.28
CC_VIVATEX_GT_INTERNET_VILLA_NUEVA 14:18:49 late up up default 979 300 4563376.03 1.2
CC_VOLCAN_STA_MARIA_GT_INTERNET_CRUCE_BARCENAS 14:19:44 late up up default 981 300 44late poll
nmisslvcc5 14:18:55 late up up default 344 300 376209.90 2.33 1.23
totalNodes=2615 totalPoll=2267 ontime=73 pingOnly=0 1x_late=2190 3x_late=3 12x_late=1 144x_late=0
time=10:10:07 pingDown=354 snmpDown=359 badSnmp=295 noSnmp=0 demoted=348
[root@opmantek ~]#
|
Viewing disk usage information
This command helps us to monitor the load of an input and output device, observing the time that the devices are active in relation to the average of their transfer rates. It can also be used to compare activity between disks.
Using 100% iowait / Utilization indicates that there is a problem and in most cases a big problem that can even lead to data loss. Essentially, there is a bottleneck somewhere in the system. Perhaps one of the drives is preparing to die / fail.
OMK recommends executing the command in the following way, since this gives a better scenario than what happens with the disks.
Example: the command shows 5 samples made every 3 seconds, what we want is that at least 3 of the samples reflect data within the stable range for the server, otherwise this indicates that there is a problem with the disks.
Code Block |
---|
[root@opmantek ~]# iostat -xtc 3 5
Linux 2.6.32-754.28.1.el6.x86_64 (opmantek) 04/05/2021 _x86_64_ (8 CPU)
04/05/2021 09:23:40 PM
avg-cpu: %user %nice %system %iowait %steal %idle
12.47 0.00 0.73 10.53 0.00 86.72
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 4.50 35.50 148.00 452.00 15.00 110.98 4468.74 274.22 5000 0.60 100.00
sdb 0.00 42.50 0.00 6.50 0.00 392.00 60.31 0.13 20.00 0.00 20 0.34 92.12
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.65 56.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.86 10.50
dm-2 0.00 0.00 4.50 52.00 140.00 416.00 9.84 149.56 5229.59 274.22 5658 0.21 25.03
dm-3 0.00 0.00 0.00 0.50 0.00 4.00 8.00 66.00 0.00 0.00 0 0.45 14.40
04/05/2021 09:23:43 PM
avg-cpu: %user %nice %system %iowait %steal %idle
18.17 0.00 5.29 6.31 0.00 76.82
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 50.00 9.50 19.00 596.00 260.00 30.04 130.41 2569.47 283.11 3712 0.60 92.36
sdb 0.00 36.50 0.50 59.00 8.00 764.00 12.97 25.34 425.82 18.00 429 0.25 78.82
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.23 92.45
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.86 88.93
dm-2 0.00 0.00 8.00 163.50 440.00 1308.00 10.19 240.76 966.94 337.38 997 0.37 68.28
dm-3 0.00 0.00 0.00 33.00 0.00 264.00 8.00 48.31 0.00 0.00 0 0.18 12.75
04/05/2021 09:23:46 PM
avg-cpu: %user %nice %system %iowait %steal %idle
2.50 0.00 1.21 11.37 0.00 75.56
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 9.50 18.00 268.00 220.00 17.75 112.91 1763.73 143.42 2618 0.85 100.00
sdb 0.00 10.00 2.00 1.50 112.00 92.00 58.29 0.01 3.86 6.25 0 0.94 97.54
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.45 75.39
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.78 24.96
dm-2 0.00 0.00 13.50 11.50 552.00 92.00 25.76 185.21 3029.96 101.85 6467 0.25 67.18
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.86 43.91
04/05/2021 09:23:49 PM
avg-cpu: %user %nice %system %iowait %steal %idle
12.10 0.00 7.21 9.17 0.00 87.92
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 55.50 7.00 44.00 92.00 488.00 11.37 110.52 929.20 139.86 1054 0.75 89.54
sdb 0.00 65.00 0.50 34.00 4.00 792.00 23.07 0.83 24.09 1.00 24 0.55 93.61
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.14 99.99
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.36 78.98
dm-2 0.00 0.00 7.00 242.50 84.00 1940.00 8.11 179.44 240.22 137.36 243 0.75 25.30
dm-3 0.00 0.00 0.00 5.00 0.00 40.00 8.00 1.30 305.90 0.00 305 0.23 45.12
04/05/2021 09:23:52 PM
avg-cpu: %user %nice %system %iowait %steal %idle
9.50 0.00 11.21 19.30 0.00 92.92
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.16 114.34 7.02 191.18 132.04 2444.27 13.00 3.60 18.18 81.41 15 0.14 99.99
sdb 0.03 205.87 2.36 70.03 31.22 2207.55 30.92 5.81 80.25 53.76 81 0.94 97.54
dm-0 0.00 0.00 0.10 1.01 11.77 8.07 17.90 0.84 755.10 72.31 822 0.60 98.36
dm-1 0.00 0.00 0.09 0.13 0.74 1.03 8.00 0.22 985.66 153.25 1580 0.47 94.48
dm-2 0.00 0.00 9.25 575.59 129.18 4604.83 8.09 6.09 9.74 74.24 8 0.61 82.37
dm-3 0.00 0.00 0.12 4.74 21.57 37.89 12.24 2.52 518.00 131.58 527 0.23 93.15
[root@opmantek ~]#
|