Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page is intended to provide a NMIS Device Troubleshooting Process to Identify bad behaviors in collection for NMIS8/9 products, you can break it down into clear steps that anyone can follow and identify what's wrong with the device collection also if we have Gaps in Graphs for the nodes managed by NMIS.

...

It is important to review the load average and iowait, if we see this values are high that represents problems for the server

Image RemovedImage Added


 

List of processes

The ps command provides us with information about the processes of a Linux or Unix system.
Sometimes tasks can hang, go into a closed-loop, or stop responding. For other reasons, or they may continue to run, but gobble up too much CPU or RAM time, or behave in an equally antisocial manner. Sometimes tasks need to be removed as a mercy to everyone involved. The first step. Of course, it is to identify the process in question.

Processes in a "D" or uninterruptible sleep state are usually waiting on I/O.


Code Block
[root@nmisslvcc5 log]# ps -auxf | egrep " D| Z"
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root      1563  0.1  0.0      0     0 ?        D    Mar17  10:47  \_ [jbd2/dm-2-8]
root      1565  0.0  0.0      0     0 ?        D    Mar17   0:43  \_ [jbd2/dm-3-8]
root      1615  0.3  0.0      0     0 ?        D    Mar17  39:26  \_ [flush-253:2]
root      1853  0.0  0.0  29764   736 ?        D<sl Mar17   0:04 auditd
root     17898  0.0  0.0 103320   872 pts/5    S+   12:20   0:00  |       \_ egrep  D| Z
apache   17856 91.0  0.2 205896 76212 ?        D    12:19   0:01  |   \_ /usr/bin/perl /usr/local/nmis8/
root     13417  0.6  0.8 565512 306812 ?       D    10:38   0:37  \_ opmantek.pl webserver             -
root     17833  9.8  0.0      0     0 ?        Z    12:19   0:00      \_ [opeventsd.pl] <defunct>
root     17838 10.3  0.0      0     0 ?        Z    12:19   0:00      \_ [opeventsd.pl] <defunct>
root     17842 10.6  0.0      0     0 ?        Z    12:19   0:00      \_ [opeventsd.pl] <defunct>

...

OMK recommends executing the command in the following way, since this gives a better scenario than what happens with the disks.
Example: the command shows 5 samples made every 3 seconds, what we want is that at least 3 of the samples reflect data within the stable range for the server, otherwise this indicates that there is a problem with the disks.

Image RemovedImage Added


Code Block
[root@opmantek ~]# iostat -xtc 3 5
Linux 2.6.32-754.28.1.el6.x86_64 (opmantek)     04/05/2021      _x86_64_        (8 CPU)

04/05/2021 09:23:40 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           12.47    0.00    0.73   10.53    0.00   86.72

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    4.50   35.50   148.00   452.00    15.00   110.98 4468.74  274.22 5000    0.60    100.00
sdb               0.00    42.50    0.00    6.50     0.00   392.00    60.31     0.13   20.00    0.00   20    0.34    92.12
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0    0.65    56.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0    0.86    10.50
dm-2              0.00     0.00    4.50   52.00   140.00   416.00     9.84   149.56 5229.59  274.22 5658    0.21    25.03
dm-3              0.00     0.00    0.00    0.50     0.00     4.00     8.00    66.00    0.00    0.00    0    0.45    14.40


04/05/2021 09:23:43 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           18.17    0.00    5.29    6.31    0.00   76.82

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    50.00    9.50   19.00   596.00   260.00    30.04   130.41 2569.47  283.11 3712     0.60    92.36
sdb               0.00    36.50    0.50   59.00     8.00   764.00    12.97    25.34  425.82   18.00  429     0.25    78.82
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0     0.23    92.45
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0     0.86    88.93
dm-2              0.00     0.00    8.00  163.50   440.00  1308.00    10.19   240.76  966.94  337.38  997     0.37    68.28
dm-3              0.00     0.00    0.00   33.00     0.00   264.00     8.00    48.31    0.00    0.00    0     0.18    12.75 

04/05/2021 09:23:46 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.50    0.00    1.21    11.37    0.00   75.56

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    9.50   18.00   268.00   220.00    17.75   112.91 1763.73  143.42 2618     0.85    100.00
sdb               0.00    10.00    2.00    1.50   112.00    92.00    58.29     0.01    3.86    6.25    0     0.94     97.54
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0     0.45     75.39
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0     0.78     24.96
dm-2              0.00     0.00   13.50   11.50   552.00    92.00    25.76   185.21 3029.96  101.85 6467     0.25     67.18 
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0     0.86     43.91

04/05/2021 09:23:49 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           12.10    0.00    7.21    9.17    0.00   87.92

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    55.50    7.00   44.00    92.00   488.00    11.37   110.52  929.20  139.86 1054     0.75    89.54
sdb               0.00    65.00    0.50   34.00     4.00   792.00    23.07     0.83   24.09    1.00   24     0.55    93.61
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0     0.14    99.99
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0     0.36    78.98
dm-2              0.00     0.00    7.00  242.50    84.00  1940.00     8.11   179.44  240.22  137.36  243     0.75    25.30
dm-3              0.00     0.00    0.00    5.00     0.00    40.00     8.00     1.30  305.90    0.00  305     0.23    45.12

04/05/2021 09:23:52 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.50    0.00    11.21    19.30    0.00   92.92

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.16   114.34    7.02  191.18   132.04  2444.27    13.00     3.60   18.18   81.41   15     0.14    99.99    
sdb               0.03   205.87    2.36   70.03    31.22  2207.55    30.92     5.81   80.25   53.76   81     0.94    97.54
dm-0              0.00     0.00    0.10    1.01    11.77     8.07    17.90     0.84  755.10   72.31  822     0.60    98.36
dm-1              0.00     0.00    0.09    0.13     0.74     1.03     8.00     0.22  985.66  153.25 1580     0.47    94.48
dm-2              0.00     0.00    9.25  575.59   129.18  4604.83     8.09     6.09    9.74   74.24    8     0.61    82.37
dm-3              0.00     0.00    0.12    4.74    21.57    37.89    12.24     2.52  518.00  131.58  527     0.23    93.15

[root@opmantek ~]#

...