This page is intended to provide a NMIS Device Troubleshooting Process to Identify bad behaviors in collection for NMIS8/9 products, you can break it down into clear steps that anyone can follow and identify what's wrong with the device collection also if we have Gaps in Graphs for the nodes managed by NMIS.
...
It is important to review the load average and iowait, if we see this values are high that represents problems for the server
List of processes
The ps command provides us with information about the processes of a Linux or Unix system.
Sometimes tasks can hang, go into a closed-loop, or stop responding. For other reasons, or they may continue to run, but gobble up too much CPU or RAM time, or behave in an equally antisocial manner. Sometimes tasks need to be removed as a mercy to everyone involved. The first step. Of course, it is to identify the process in question.
Processes in a "D" or uninterruptible sleep state are usually waiting on I/O.
Code Block |
---|
[root@nmisslvcc5 log]# ps -auxf | egrep " D| Z" Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ root 1563 0.1 0.0 0 0 ? D Mar17 10:47 \_ [jbd2/dm-2-8] root 1565 0.0 0.0 0 0 ? D Mar17 0:43 \_ [jbd2/dm-3-8] root 1615 0.3 0.0 0 0 ? D Mar17 39:26 \_ [flush-253:2] root 1853 0.0 0.0 29764 736 ? D<sl Mar17 0:04 auditd root 17898 0.0 0.0 103320 872 pts/5 S+ 12:20 0:00 | \_ egrep D| Z apache 17856 91.0 0.2 205896 76212 ? D 12:19 0:01 | \_ /usr/bin/perl /usr/local/nmis8/ root 13417 0.6 0.8 565512 306812 ? D 10:38 0:37 \_ opmantek.pl webserver - root 17833 9.8 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct> root 17838 10.3 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct> root 17842 10.6 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct> |
...
OMK recommends executing the command in the following way, since this gives a better scenario than what happens with the disks.
Example: the command shows 5 samples made every 3 seconds, what we want is that at least 3 of the samples reflect data within the stable range for the server, otherwise this indicates that there is a problem with the disks.
Code Block |
---|
[root@opmantek ~]# iostat -xtc 3 5 Linux 2.6.32-754.28.1.el6.x86_64 (opmantek) 04/05/2021 _x86_64_ (8 CPU) 04/05/2021 09:23:40 PM avg-cpu: %user %nice %system %iowait %steal %idle 12.47 0.00 0.73 10.53 0.00 86.72 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 4.50 35.50 148.00 452.00 15.00 110.98 4468.74 274.22 5000 0.60 100.00 sdb 0.00 42.50 0.00 6.50 0.00 392.00 60.31 0.13 20.00 0.00 20 0.34 92.12 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.65 56.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.86 10.50 dm-2 0.00 0.00 4.50 52.00 140.00 416.00 9.84 149.56 5229.59 274.22 5658 0.21 25.03 dm-3 0.00 0.00 0.00 0.50 0.00 4.00 8.00 66.00 0.00 0.00 0 0.45 14.40 04/05/2021 09:23:43 PM avg-cpu: %user %nice %system %iowait %steal %idle 18.17 0.00 5.29 6.31 0.00 76.82 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 50.00 9.50 19.00 596.00 260.00 30.04 130.41 2569.47 283.11 3712 0.60 92.36 sdb 0.00 36.50 0.50 59.00 8.00 764.00 12.97 25.34 425.82 18.00 429 0.25 78.82 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.23 92.45 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.86 88.93 dm-2 0.00 0.00 8.00 163.50 440.00 1308.00 10.19 240.76 966.94 337.38 997 0.37 68.28 dm-3 0.00 0.00 0.00 33.00 0.00 264.00 8.00 48.31 0.00 0.00 0 0.18 12.75 04/05/2021 09:23:46 PM avg-cpu: %user %nice %system %iowait %steal %idle 2.50 0.00 1.21 11.37 0.00 75.56 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 9.50 18.00 268.00 220.00 17.75 112.91 1763.73 143.42 2618 0.85 100.00 sdb 0.00 10.00 2.00 1.50 112.00 92.00 58.29 0.01 3.86 6.25 0 0.94 97.54 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.45 75.39 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.78 24.96 dm-2 0.00 0.00 13.50 11.50 552.00 92.00 25.76 185.21 3029.96 101.85 6467 0.25 67.18 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.86 43.91 04/05/2021 09:23:49 PM avg-cpu: %user %nice %system %iowait %steal %idle 12.10 0.00 7.21 9.17 0.00 87.92 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 55.50 7.00 44.00 92.00 488.00 11.37 110.52 929.20 139.86 1054 0.75 89.54 sdb 0.00 65.00 0.50 34.00 4.00 792.00 23.07 0.83 24.09 1.00 24 0.55 93.61 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.14 99.99 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.36 78.98 dm-2 0.00 0.00 7.00 242.50 84.00 1940.00 8.11 179.44 240.22 137.36 243 0.75 25.30 dm-3 0.00 0.00 0.00 5.00 0.00 40.00 8.00 1.30 305.90 0.00 305 0.23 45.12 04/05/2021 09:23:52 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.50 0.00 11.21 19.30 0.00 92.92 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.16 114.34 7.02 191.18 132.04 2444.27 13.00 3.60 18.18 81.41 15 0.14 99.99 sdb 0.03 205.87 2.36 70.03 31.22 2207.55 30.92 5.81 80.25 53.76 81 0.94 97.54 dm-0 0.00 0.00 0.10 1.01 11.77 8.07 17.90 0.84 755.10 72.31 822 0.60 98.36 dm-1 0.00 0.00 0.09 0.13 0.74 1.03 8.00 0.22 985.66 153.25 1580 0.47 94.48 dm-2 0.00 0.00 9.25 575.59 129.18 4604.83 8.09 6.09 9.74 74.24 8 0.61 82.37 dm-3 0.00 0.00 0.12 4.74 21.57 37.89 12.24 2.52 518.00 131.58 527 0.23 93.15 [root@opmantek ~]# |
...