This page is intended to provide a NMIS Device Troubleshooting Process to Identify bad behaviors in collection for NMIS8/9 products, you can break it down into clear steps that anyone can follow and identify what's wrong with the device collection also if we have Gaps In Graphs.
Device Troubleshooting Process
Diagramas de Flujo
- Identify the problem. The first step in troubleshooting a device issue is to identify the problem, you have to consider if the issue is in NMIS8 or NMIS9 products.
- Add to the support the case the product version and the servers/devices/models involved.
- What kind of problem are you observing. A device issue can be affected for the next reasons.
- Network performance, latency in the network, layer 1,2, and 3 issues.
- Device configuration, connectivity, SNMP configuration, and others.
- Server hardware requirements, high resource utilization parameters in the server.
- Server configuration options, missing configuration items for server tunning.
- Disk performance, slow write/read times for the device collection.
- Gather information, collect all the graphs, images, behaviors that can explain what the problem is.
- Collect support tool files The Opmantek Support Tool
Execute the collect command for the support tool
#General collection. /usr/local/nmis8/admin/support.pl action=collect #If the file is big, we can add the next parameter. /usr/local/nmis8/admin/support.pl action=collect maxzipsize=900000000 #Device collection. /usr/local/nmis8/admin/support.pl action=collect node=<node_name> maxzipsize=900000000
- If you are using NMIS8, provide the /usr/local/nmis8/var files
go to /usr/local/nmis8/var directory and collect the next files
-rw-rw---- 1 nmis nmis 4292 Apr 5 18:26 <node_name>-node.json -rw-rw---- 1 nmis nmis 2695 Apr 5 18:26 <node_name>-view.json
obtain update/collect outputs this information will upload to the support case:
/usr/local/nmis8/bin/nmis.pl type=update node=<node_name> model=true debug=9 force=true > /tmp/node_name_update_$(hostname).log /usr/local/nmis8/bin/nmis.pl type=collect node=<node_name> model=true debug=9 force=true > /tmp/node_name_collect_$(hostname).log
- Collect support tool files The Opmantek Support Tool
- Replicate the problem. If possible you have to define, what the steps are to replicate the problem.
- Identify symptoms. To this point, you are able to see a specific problem and what the symptoms are.
- Determinate if something has changed, is important to verify with your team if something has changed, a good way to see this behavior is monitoring the performance graph for devices and server
- It is an individual problem?, verify if this behavior is happening in a single device/server.
Network performance - Server.
Introduction.
This section is focused on performing the review and validation of the server status in general, we will focus on verifying the historical behavior of the main metrics for the server, it is important to review all the metrics related to the good performance between the server and devices
Verifying Health Metrics
- Metrics are important for the server, NMIS would use Reachability, Availability and Health to represent the network.
Reachability being the pingability of device,
Availability being (in the context of network gear) the interfaces which should be up, being up or not, e.g. interfaces which are "no shutdown" (ifAdminStatus = up) should be up, so a device with 10 interfaces of ifAdminStatus = up and ifOperStatus = up for 9 interfaces, the device would be 90% available.
Health is a composite metric, made up of many things depending on the device, router, CPU, memory. Something interesting here is that part of the health is made up of an inverse of interface utilisation, so an interface which has no utilisation will have a high health component, an interface which is highly utilised will reduce that metric. So the health is a reflection of load on the device, and will be very dynamic.
The overall metric of a device is a composite metric made up of weighted values of the other metrics being collected. The formula for this is based is configurable, so you can have weight Reachability to be higher than it currently is, or lower, your choice.
For more references go to NMIS Metrics, Reachability, Availability and Health
- It is important to validate the localhost heath, including the overall reachability, availability, and Health you will be able to see data not following the historical data pattern that can give us a clue where the problem can be happening or even if the abnormal behavior has started before a change request In the early hours.
- Viewing the graphs referring to the network performance as (Response Time in milliseconds, IP Utilization, TCP Connection, TCP Segments) will help us to identify the behavior of the server/network in a period of 2 days, we can modify this period time to see more data if needed.
Device configuration.
It is important to validate if the problem occurs in the network or is something related to the device configuration, in order to identify what's happening we need to validate the next commands from the console server.
Ping test, The Ping tool is used to test whether a particular host is reachable across an IP network. A Ping measures the time it takes for packets to be sent from the local host to a destination computer and back.
ping x.x.x.x #add the ip address you need to reach
Traceroute, is a network diagnostic tool used to track in real-time the pathway taken by a packet on an IP network from source to destination, reporting the IP addresses of all the routers it pinged in between
traceroute <ip_Node> #add the ip address you need to reach
MTR, Mtr(my traceroute) is a command-line network diagnostic tool that provides the functionality of both the ping and traceroute commands
sudo mtr -r 8.8.8.8 [sample results below] HOST: endor Loss% Snt Last Avg Best Wrst StDev 1. 69.28.84.2 0.0% 10 0.4 0.4 0.3 0.6 0.1 2. 38.104.37.141 0.0% 10 1.2 1.4 1.0 3.2 0.7 3. te0-3-1-1.rcr21.dfw02.atlas. 0.0% 10 0.8 0.9 0.8 1.0 0.1 4. be2285.ccr21.dfw01.atlas.cog 0.0% 10 1.1 1.1 0.9 1.4 0.1 5. be2432.ccr21.mci01.atlas.cog 0.0% 10 10.8 11.1 10.8 11.5 0.2 6. be2156.ccr41.ord01.atlas.cog 0.0% 10 22.9 23.1 22.9 23.3 0.1 7. be2765.ccr41.ord03.atlas.cog 0.0% 10 22.8 22.9 22.8 23.1 0.1 8. 38.88.204.78 0.0% 10 22.9 23.0 22.8 23.9 0.4 9. 209.85.143.186 0.0% 10 22.7 23.7 22.7 31.7 2.8 10. 72.14.238.89 0.0% 10 23.0 23.9 22.9 32.0 2.9 11. 216.239.47.103 0.0% 10 50.4 61.9 50.4 92.0 11.9 12. 216.239.46.191 0.0% 10 32.7 32.7 32.7 32.8 0.1 13. ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 14. google-public-dns-a.google.c 0.0% 10 32.7 32.7 32.7 32.8 0.0
- snmpwalk, is a Simple Network Management Protocol (SNMP) application present on the Security Management System (SMS) CLI that uses SNMP GETNEXT requests to query a network device for information. An object identifier (OID) may be given on the command line.It is important to see that the device is pingable, does not have latency, packet loss, and the SNMP data is been collected.
The following example CLI command will return the IPS temperature information: Command:snmpwalk -v 2c -c tinapc <IP address> 1.3.6.1.4.1.10734.3.5.2.5.5 Command Explanation: In this case the CLI command breaks down as following; snmpwalk = SNMP application -v 2c = specifies what SNMP version to use (1, 2c, 3) -c tinapc = specifies the community string. Note: The IPS has the SNMP read-only community string of "tinapc" <IP address> = specifies the IP address of the IPS device 1.3.6.1.4.1.10734.3.5.2.5.5 = OID parameter for the IPS temperature information Results: SNMPv2-SMI::enterprises.10734.3.5.2.5.5.1.0 = INTEGER: 27 SNMPv2-SMI::enterprises.10734.3.5.2.5.5.2.0 = INTEGER: 50 SNMPv2-SMI::enterprises.10734.3.5.2.5.5.3.0 = INTEGER: 55 SNMPv2-SMI::enterprises.10734.3.5.2.5.5.4.0 = INTEGER: 0 SNMPv2-SMI::enterprises.10734.3.5.2.5.5.5.0 = INTEGER: 85 Results Explanation: SNMPv2-SMI::enterprises.10734.3.5.2.5.5.1.0 = INTEGER: 27 = The chassis temperature (27° Celsius / 80.6° Fahrenheit) SNMPv2-SMI::enterprises.10734.3.5.2.5.5.2.0 = INTEGER: 50 = The major threshold value for chassis temperature (50° Celsius / 122° Fahrenheit) SNMPv2-SMI::enterprises.10734.3.5.2.5.5.3.0 = INTEGER: 55 = The critical threshold value of chassis temperature (55° Celsius / 131° Fahrenheit) SNMPv2-SMI::enterprises.10734.3.5.2.5.5.4.0 = INTEGER: 0 = The minimum value of the chassis temperature range ( 0° Celsius / 32° Fahrenheit) SNMPv2-SMI::enterprises.10734.3.5.2.5.5.5.0 = INTEGER: 85 = The maximum value of the chassis temperature range (85° Celsius / 185° Fahrenheit)
Service performance
NMIS is using some important services to make the solution work, sometimes devices stop working due to some of these services are interrupted, It is always a good idea to validate if those are running, to validate this you need to execute the next commands. This in order to provide even more security, as some of these services are crucial for the operation of the operating system. On the other hand, in systems like Unix or Linux, the services are also known as daemons. In this case, it is essential to validate the services that make up the OPMANTEK monitoring system (nmis).
service mongod status service omkd status service nmisd status service httpd status service opchartsd restart service opeventsd status service opconfigd status service opflowd status service crond status #if someone of this daemons is stopped, you need to execute same commands with start/restart options.
Server hardware requirements.
This section is crucial to identify or resolve device issues, you need to review some considerations depending on the number of nodes you will manage, the number of users that will be accessing the GUI's, how often does your data need to be updated? If updates are required every 5 minutes, then you will need to have the hardware to be able to accomplish these requirements, also the OS Requirements need to be well defined a good rule of thumb is to reserve 1 GB of RAM for the OS by default, High-speed drives for the data (SAN is ideal) with separate storage for mongo database, and temp files. Anywhere between 4-8 cores with a high-performing processor(s), 16-64 GB RAM should be performing well for 1k+ Nodes.
Using top/htop command
The top command shows all running processes in the server. It shows you the system information and the processes information just like up-time, average load, tasks running, no. of users logged in, no. of CPU processes, RAM utilization and it lists all the processes running/utilized by the users in your server.
top
top - 12:50:01 up 62 days, 22:56, 5 users, load average: 4.76, 8.03, 4.34 Tasks: 412 total, 1 running, 411 sleeping, 0 stopped, 15 zombie Cpu(s): 6.8%us, 3.8%sy, 0.2%ni, 74.4%id, 28.2%wa, 0.1%hi, 0.5%si, 0.0%st Mem: 20599548k total, 18622368k used, 1977180k free, 375212k buffers Swap: 6669720k total, 3536428k used, 3133292k free, 10767256k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 26306 root 20 0 478m 257m 1900 S 3.9 1.3 0:08.21 nmis.pl 15522 root 20 0 626m 373m 2776 S 2.0 1.9 71:45.09 opeventsd.pl 27285 root 20 0 15280 1444 884 R 2.0 0.0 0:00.01 top 1 root 20 0 19356 308 136 S 0.0 0.0 1:07.65 init 2 root 20 0 0 0 0 S 0.0 0.0 0:02.14 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 17359:19 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 252:25.86 ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/0 6 root RT 0 0 0 0 S 0.0 0.0 2233:33 watchdog/0 7 root RT 0 0 0 0 S 0.0 0.0 340:35.60 migration/1 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/1 9 root 20 0 0 0 0 S 0.0 0.0 5:23.87 ksoftirqd/1 10 root RT 0 0 0 0 S 0.0 0.0 214:57.35 watchdog/1
1.First line
The very first line of the top command indicates in the order below.
top - 12:50:01 up 62 days, 22:56, 5 users, load average: 4.76, 8.03, 4.34
- current time (12:50:01)
- uptime of the machine (up 62 days, 22:56)
- users sessions logged in (5 users)
- average load on the system (load average: 4.76, 8.03, 4.34) the 3 values refer to the last minute, five minutes and 15 minutes
2. Second Row : task
The second row provides you the following information.
Tasks: 412 total, 1 running, 411 sleeping, 0 stopped, 15 zombie
- Total Processes running (412 total)
- Running Processes (1 running)
- Sleeping Processes (411 sleeping)
- Stopped Processes (0 stopped)
- Processes waiting to be stopped from the parent process (15 zombies) ####### This is not good for the manager
Zombie Process: A process that has completed execution, but still has an entry in the process table. This entry still needs to allow the parent process to read its child exit status.
3. CPU section.
Cpu(s): 6.8%us, 3.8%sy, 0.2%ni, 74.4%id, 28.2%wa, 0.1%hi, 0.5%si, 0.0%st
User processes of CPU in percentage(6.8%us)
System processes of CPU in percentage(3.8%sy)
Priority upgrade nice of CPU in percentage(0.2%ni)
Percentage of the CPU not used (74.4%id)
Processes waiting for I/O operations of CPU in percentage(28.2%wa) ####### This is not good for the server performance.
Serving hardware interrupts of CPU in percentage(0.1% hi — Hardware IRQ
Percentage of the CPU serving software interrupts (0.0% si — Software Interrupts
The amount of CPU ‘stolen’ from this virtual machine by the hypervisor for other tasks (such as running another virtual machine) will be 0 on desktop and server without Virtual machine. (0.0%st — Steal Time)
4. Memory
These rows will provide you the information about RAM usage. It shows you total memory in use, free, buffers cached.
Mem: 20599548k total, 18622368k used, 1977180k free, 375212k buffers
Swap: 6669720k total, 3536428k used, 3133292k free, 10767256k cached
5. Process List
There is the last row to discuss CPU usage which was running currently
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 26306 root 20 0 478m 257m 1900 S 3.9 1.3 0:08.21 nmis.pl 15522 root 20 0 626m 373m 2776 S 2.0 1.9 71:45.09 opeventsd.pl 27285 root 20 0 15280 1444 884 R 2.0 0.0 0:00.01 top
- PID – ID of the process(26306)
- USER – The user that is the owner of the process (root)
- PR – priority of the process (20)
- NI – The “NICE” value of the process (0)
- VIRT – virtual memory used by the process (478m)
- RES – physical memory used from the process (3.3g)
- SHR – shared memory of the process (1900)
- S – indicates the status of the process: S=sleep R=running Z=zombie (S)
- %CPU – This is the percentage of CPU used by this process (3.9)
- %MEM – This is the percentage of RAM used by the process (1.3)
- TIME+ –This is the total time of activity of this process (0:08.21)
- COMMAND – And this is the name of the process (exim)
It is important to monitor this commando to see if the server is working properly executing all the internal processes need.
Server configuration options.
In order to tell the server, how to manage the devices configured we need to validate that all the configuration items are well set, you can see the server performances while collecting information going to the section, system>Host Diagnostics> NMIS Runtime Graph
if the total runtime/collect time is too high, we need to adjust the collect parameters depending on the manager version you are using.
NMIS8
NMIS 8 - Configuration Options for Server Performance Tuning
NIMS9
NMIS 9 - Configuration Options for Server Performance Tuning
Disk performance review.
This section is dedicated to identifying when the server is not writing all the data for the devices, this can have as a result graph with interruptions, so this causes level 2 problems (Severe impact - Unreliable production system) or even in some occasions level 1 (Critical for the business, complete loss of service, loss of data) to the client, so it is essential to determine what is happening and provide a diagnosis.
Server status at Service level.
The monitoring service is affected slowly when accessing the GUI, and its main impact is centered on the failure to execute collect and updates to the nodes, the CPUs are saturated and the monitoring system executes the collection of information every minute or 5 minutes, the system being overloaded is forced to kill the processes affecting the storage of the information of the nodes in the RRD's files
Node View in NMIS:
You will be able to visualize device graphs with gaps, this is an example of how to recognize this behavior.
NMIS Polling Summary (menú: System > Host Diagnostics> NMIS Polling Summary)
El resumen de poleo que proporciona Nmis es muy útil, ya que en el podremos ver los detalles de tiempo de colección de los nodos, nodos activos, nodos colectados, etc,. Estos valores deben de estar acorde a los numero de nodos monitoreados, asi mismo el tiempo de colección debe estar entre el rango de minutos configurados en el cron de nmis.
Network Metrics and Health (menú: Network Status > Network Metrics and Health.)
Aquí podremos validar la disponibilidad del servidor, estado de salud y alcance, estos valores debe de estar por encima de 60 para considerar que se esta trabajando bien, aquí se ven cortes por lo que indica que no se colecto información en ese lapso de tiempo.
Configuracion de archivo Cron (nmis) y Config.nmis
Aquí se procederá a verificar la configuracion de colección de datos hacia los dispositivos, por lo que validamos los parámetros de Collect, maxthreads y mthread.
En el archivo Cron de nmis vemos lo siguiente:
###################################################### # NMIS8 Config ###################################################### # Run Full Statistics Collection */5 * * * * root /usr/local/nmis8/bin/nmis.pl type=collect maxthreads=100 mthread=true */5 * * * * root /usr/local/nmis8/bin/nmis.pl type=services mthread=true # ###################################################### # Optionally run a more frequent Services-only Collection # */3 * * * * root /usr/local/nmis8/bin/nmis.pl type=services mthread=true ###################################################### # Run Summary Update every 2 minutes */2 * * * * root /usr/local/nmis8/bin/nmis.pl type=summary
Procedemos a verificar que el valor del mthread este activado y que el maxthreads tenga en mismo valor en el archivo Config.nmis
'nmis_group' => 'nmis', 'nmis_host' => 'nmissTest_OMK.omk.com', 'nmis_host_protocol' => 'http', 'nmis_maxthreads' => '100', 'nmis_mthread' => 'false', 'nmis_summary_poll_cycle' => 'false', 'nmis_user' => 'nmis',
Podemos ver que el valor mthread esta desactivado y que el valor maxthreads si corresponde al mismo declarado en el cron de nmis, por lo que se procede a activarlo y a realizar un update y collect al nodo.
/usr/local/nmis8/bin/nmis.pl type=update node=<Name_Node> force=true /usr/local/nmis8/bin/nmis.pl type=collect node=<Name_Node> force=true
Nota: Si estos valores declarados en el cron y en el archivo Conf.nmis no funcionan se recomienda realizar lo siguiente:
# Ejemplo 1: /usr/local/nmis8/bin/nmis.pl type=collect abort_after=300 mthread=true maxthreads=100 ignore_running=true # Ejemplo 2 /usr/local/nmis8/bin/nmis.pl type=collect abort_after=240 mthread=true maxthreads=100 ignore_running=true
El valor del parámetro maxthreads (se recomienda probar entre 50, 80 y 100) debe ser el mismo en ambos archivos (cron nmis y conf.nmis)
Aplicar los comandos de Update y Collect al termino de cada prueba y verificar el comportamiento en la GUI de NMIS, esto consiste en revisar las gráficas de NMIS Runtime Graph, Network_summary y Polling_summary.
Health Check
It is always a good idea to run a health check of the server to validate if all is good.
Server OS
To see which OS system is runnning
Command.
cat /etc/*release* && uname -osmr && uname -v
[root@opmantek ~]# cat /etc/*release* && uname -osmr && uname -v CentOS release 6.10 (Final) CentOS release 6.10 (Final) CentOS release 6.10 (Final) cpe:/o:centos:linux:6:GA Linux 2.6.32-754.28.1.el6.x86_64 x86_64 GNU/Linux #1 SMP Wed Mar 11 18:38:45 UTC 2020 [root@opmantek ~]#
File system
It is important to validate that the file systems are free, if we have a FS full the tool will stop to work:
echo -e "\n \e[31m Información de espacio en el disco \e[0m" && df -h && echo -e "\n\n \e[31m Información de uso de RAM \e[0m" && free -m && echo -e "\n\n \e[31m Detalle de discos \e[0m" && fdisk -l
Resultado:
%wa
It is important to review the load average and iowait, if we see this values are high that represents problems for the server
List of processes
The ps command provides us with information about the processes of a Linux or Unix system.
Sometimes tasks can hang, go into a closed-loop, or stop responding. For other reasons, or they may continue to run, but gobble up too much CPU or RAM time, or behave in an equally antisocial manner. Sometimes tasks need to be removed as a mercy to everyone involved. The first step. Of course, it is to identify the process in question.
Processes in a "D" or uninterruptible sleep state are usually waiting on I/O.
[root@nmisslvcc5 log]# ps -auxf | egrep " D| Z" Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ root 1563 0.1 0.0 0 0 ? D Mar17 10:47 \_ [jbd2/dm-2-8] root 1565 0.0 0.0 0 0 ? D Mar17 0:43 \_ [jbd2/dm-3-8] root 1615 0.3 0.0 0 0 ? D Mar17 39:26 \_ [flush-253:2] root 1853 0.0 0.0 29764 736 ? D<sl Mar17 0:04 auditd root 17898 0.0 0.0 103320 872 pts/5 S+ 12:20 0:00 | \_ egrep D| Z apache 17856 91.0 0.2 205896 76212 ? D 12:19 0:01 | \_ /usr/bin/perl /usr/local/nmis8/ root 13417 0.6 0.8 565512 306812 ? D 10:38 0:37 \_ opmantek.pl webserver - root 17833 9.8 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct> root 17838 10.3 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct> root 17842 10.6 0.0 0 0 ? Z 12:19 0:00 \_ [opeventsd.pl] <defunct>
Test Disk I/O Performance With dd Command
The dd command is very sensitive regarding the parameters it handles since it can cause serious problems on your server, OMK uses this command to obtain and measure server performance and latency, so with this, we determine that the writing speed and reading of the disc.
[root@SRVLXLIM32 ~]# dd if=/dev/zero of=/data/omkTestFile bs=10M count=1 oflag=direct 1+0 records in 1+0 records out 10485760 bytes (10 MB) copied, 0.980106 s, 15.0 MB/s [root@SRVLXLIM32 ~]# dd if=/data/omkTestFile of=/dev/null 2>&1 20480+0 records in 20480+0 records out 10485760 bytes (10 MB) copied, 6.23595 s, 1.7 MB/s [root@SRVLXLIM32 ~]#
Parameters:
0.0X s to be correct.
0.X s, there is a warning (and there would be
issue)
X.0 s would be critical (and there would be a problem).
Please note that one gigabyte was written for the test and 47 MB/s was the performance and the time it took to write the block was 0.223301 seconds from the server for this test.
Where:
- if=/dev/zero (if=/dev/input.file) : The name of the input file you want dd the read from.
- of=/data/omkTestFile (of=/path/to/output.file) : The name of the output file you want dd write the input.file to.
- bs=10M (bs = block-size): set the size of the block you want dd to use. Note that Linux will need free RAM space. If your test system doesn't have enough RAM available, use a smaller parameter for bs (like 128MB or 64MB, etc. or you can even test with 1, 2, or even 3 gigabytes).
- count=1 (count=number-of-blocks): The number of blocks you want dd to read.
- oflag=dsync (oflag=dsync) : Use synchronized I/O for data. Do not skip this option. This option get rid of caching and gives you good and accurate results
- conv=fdatasyn: Again, this tells dd to require a complete “sync” once, right before it exits. This option is equivalent to oflag=dsync.
Polling summary
The OPMANTEK monitoring system has the polling_summary tool, this will help us determine if the server takes a long time to collect the information from the nodes and cannot complete any operation, here we can see how many nodes have a late collection and a summary of the collected and uncollected nodes.
NMIS8
/usr/local/nmis8/admin/polling_summary.pl
NMIS9
/usr/local/nmis9/admin/polling_summary9.pl
[root@opmantek ~]# /usr/local/nmis8/admin/polling_summary.pl node attempt status ping snmp policy delta snmp avgdel poll update pollmessage ACH-AIJ-DI-AL-SA6-0202010001-01 14:10:33 ontime up up default 328 300 422.31 22.40 17.89 ACH-AIJ-RC-ET-08K-01 --:--:-- bad_snmp up up default --- 300 403.90 10.38 14.58 snmp never successful ACH-ANA-RC-ET-08K-01 --:--:-- bad_snmp up down default --- 300 422.57 11.39 109.09 snmp never successful ACH-ATU-RC-ET-08K-01 --:--:-- bad_snmp up up default --- 300 391.99 0.97 62.88 snmp never successful ACH-CAB-DI-AL-SA6-0215010001-01 14:11:21 late up up default 484 300 5543888.62 31.06 74.21 1x late poll ACH-CAB-DR-AL-P32-01 --:--:-- bad_snmp up up default --- 300 416.30 103.46 91.28 snmp never successful ACH-CAB-GE-GM-G30-01 14:00:54 late up down default 348 300 593.93 6.06 12.53 1x late poll ACH-CAB-RC-ET-08K-01 --:--:-- bad_snmp up up default --- 300 411.74 10.69 7.31 snmp never successful ACH-CAB-TT-GM-30T-01 --:--:-- bad_snmp up down default --- 300 0.00 0.00 180.42 snmp never successful ACH-CAR-RC-ET-08K-01 14:10:20 ontime up up default 314 300 9054283.23 11.15 6.47 ACH-CAT-CN-AL-SA6-0212070008-01 14:07:39 late up up default 600 300 27253590.83 12.39 22.23 1x late poll ACH-CAZ-TT-GM-30T-01 --:--:-- bad_snmp up down default --- 300 414.85 3.11 165.32 snmp never successful ACH-CHM-DR-AL-P32-01 14:05:47 late up up default 456 300 2686074.17 118.55 148.58 1x late poll ACH-CHM-GE-GM-G20-01 --:--:-- bad_snmp up down default --- 300 413.17 4.06 238.92 snmp never successful ACH-CHM-RC-ET-09K-01 14:12:30 late up up default 633 300 1983484.93 10.49 13.07 1x late poll ACH-CHM-TT-GM-20T-01 --:--:-- bad_snmp up down default --- 300 412.17 3.61 287.80 snmp never successful ACH-COX-RC-ET-09K-01 13:51:14 late up up default 473 300 22141.04 9.54 4.10 1x late poll ACH-CSM-RC-ET-08K-01 13:51:09 late up up default 444 300 539117.26 11.25 5.31 1x late poll ACH-CSM-TT-GM-20T-01 14:08:34 late up down default 709 300 1739800.92 4.01 229.73 1x late poll ACH-HCC-CN-AL-SA6-0212030012-01 13:50:33 ontime up up default 330 300 8131293.53 23.65 23.84 ACH-HCC-RC-ET-08K-01 14:07:56 late up up default 635 300 1802552.50 0.65 1.61 1x late poll ACH-HEY-DI-AL-SA6-0211010001-01 13:50:52 late up up default 425 300 571.75 25.46 17.30 1x late poll ACH-HEY-DR-AL-P32-01 --:--:-- bad_snmp up up default --- 300 119099.96 106.25 120.92 snmp never successful ACH-HEY-GE-GM-G20-01 --:--:-- bad_snmp up down default --- 300 0.00 0.00 112.37 snmp never successful ACH-HEY-RC-ET-09K-01 --:--:-- bad_snmp up up default --- 300 404.62 11.01 7.49 snmp never successful --Snip-- --Snip-- UCA-PUC-DR-AL-P32-01 14:12:04 late up up default 524 300 124010.73 135.20 124.79 1x late poll UCA-PUC-GE-GM-G30-01 14:11:20 late up down default 475 300 3868910.82 3.68 236.48 1x late poll UCA-PUC-GE-GM-G30-02 14:12:32 late up down default 644 300 3871900.66 4.05 209.92 1x late poll UCA-PUC-RC-ET-09K-01 --:--:-- bad_snmp up up default --- 300 418.17 10.83 5.76 snmp never successful UCA-PUC-TT-GM-30A-01 --:--:-- bad_snmp up down default --- 300 397.68 4.21 215.65 snmp never successful UCA-PUC-TT-GM-30A-02 14:13:03 late up down default 720 300 329362.60 3.39 208.92 1x late poll CC_VITATRAC_GT_Z2_MAZATE 14:13:04 demoted down down default --- 300 0.00 2.22 0.80 s CC_VITATRAC_GT_Z3_COBAN 14:13:12 late up up default 618 300 4874416.57 1.91 4.46 CC_VITATRAC_GT_Z3_ESCUINTLA 14:13:12 late up up default 604 300 4902673.92 2.17 4.8 CC_VITATRAC_GT_Z7_BODEGA_MATEO 14:15:37 late up up default 642 300 3844049.73 3.25 CC_VITATRAC_GT_Z8_MIXCO 14:15:42 late up up default 634 300 4959081.87 2.47 6.70 CC_VITATRAC_GT_Z9_XELA 14:16:03 late up up default 634 300 3943302.62 8.95 58.61 CC_VITATRAC_GT_ZONA_PRADERA 14:17:47 demoted up down default 711 300 605.21 10.91 10.28 CC_VIVATEX_GT_INTERNET_VILLA_NUEVA 14:18:49 late up up default 979 300 4563376.03 1.2 CC_VOLCAN_STA_MARIA_GT_INTERNET_CRUCE_BARCENAS 14:19:44 late up up default 981 300 44late poll nmisslvcc5 14:18:55 late up up default 344 300 376209.90 2.33 1.23 totalNodes=2615 totalPoll=2267 ontime=73 pingOnly=0 1x_late=2190 3x_late=3 12x_late=1 144x_late=0 time=10:10:07 pingDown=354 snmpDown=359 badSnmp=295 noSnmp=0 demoted=348 [root@opmantek ~]#
if the values are located in the x_late fields, we need to validate the performance of the server.
Viewing disk usage information
This command helps us to monitor the load of an input and output device, observing the time that the devices are active in relation to the average of their transfer rates. It can also be used to compare activity between disks.
Using 100% iowait / Utilization indicates that there is a problem and in most cases a big problem that can even lead to data loss. Essentially, there is a bottleneck somewhere in the system. Perhaps one of the drives is preparing to die / fail.
OMK recommends executing the command in the following way, since this gives a better scenario than what happens with the disks.
Example: the command shows 5 samples made every 3 seconds, what we want is that at least 3 of the samples reflect data within the stable range for the server, otherwise this indicates that there is a problem with the disks.
[root@opmantek ~]# iostat -xtc 3 5 Linux 2.6.32-754.28.1.el6.x86_64 (opmantek) 04/05/2021 _x86_64_ (8 CPU) 04/05/2021 09:23:40 PM avg-cpu: %user %nice %system %iowait %steal %idle 12.47 0.00 0.73 10.53 0.00 86.72 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 4.50 35.50 148.00 452.00 15.00 110.98 4468.74 274.22 5000 0.60 100.00 sdb 0.00 42.50 0.00 6.50 0.00 392.00 60.31 0.13 20.00 0.00 20 0.34 92.12 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.65 56.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.86 10.50 dm-2 0.00 0.00 4.50 52.00 140.00 416.00 9.84 149.56 5229.59 274.22 5658 0.21 25.03 dm-3 0.00 0.00 0.00 0.50 0.00 4.00 8.00 66.00 0.00 0.00 0 0.45 14.40 04/05/2021 09:23:43 PM avg-cpu: %user %nice %system %iowait %steal %idle 18.17 0.00 5.29 6.31 0.00 76.82 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 50.00 9.50 19.00 596.00 260.00 30.04 130.41 2569.47 283.11 3712 0.60 92.36 sdb 0.00 36.50 0.50 59.00 8.00 764.00 12.97 25.34 425.82 18.00 429 0.25 78.82 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.23 92.45 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.86 88.93 dm-2 0.00 0.00 8.00 163.50 440.00 1308.00 10.19 240.76 966.94 337.38 997 0.37 68.28 dm-3 0.00 0.00 0.00 33.00 0.00 264.00 8.00 48.31 0.00 0.00 0 0.18 12.75 04/05/2021 09:23:46 PM avg-cpu: %user %nice %system %iowait %steal %idle 2.50 0.00 1.21 11.37 0.00 75.56 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 9.50 18.00 268.00 220.00 17.75 112.91 1763.73 143.42 2618 0.85 100.00 sdb 0.00 10.00 2.00 1.50 112.00 92.00 58.29 0.01 3.86 6.25 0 0.94 97.54 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.45 75.39 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.78 24.96 dm-2 0.00 0.00 13.50 11.50 552.00 92.00 25.76 185.21 3029.96 101.85 6467 0.25 67.18 dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.86 43.91 04/05/2021 09:23:49 PM avg-cpu: %user %nice %system %iowait %steal %idle 12.10 0.00 7.21 9.17 0.00 87.92 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 55.50 7.00 44.00 92.00 488.00 11.37 110.52 929.20 139.86 1054 0.75 89.54 sdb 0.00 65.00 0.50 34.00 4.00 792.00 23.07 0.83 24.09 1.00 24 0.55 93.61 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.14 99.99 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.36 78.98 dm-2 0.00 0.00 7.00 242.50 84.00 1940.00 8.11 179.44 240.22 137.36 243 0.75 25.30 dm-3 0.00 0.00 0.00 5.00 0.00 40.00 8.00 1.30 305.90 0.00 305 0.23 45.12 04/05/2021 09:23:52 PM avg-cpu: %user %nice %system %iowait %steal %idle 9.50 0.00 11.21 19.30 0.00 92.92 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.16 114.34 7.02 191.18 132.04 2444.27 13.00 3.60 18.18 81.41 15 0.14 99.99 sdb 0.03 205.87 2.36 70.03 31.22 2207.55 30.92 5.81 80.25 53.76 81 0.94 97.54 dm-0 0.00 0.00 0.10 1.01 11.77 8.07 17.90 0.84 755.10 72.31 822 0.60 98.36 dm-1 0.00 0.00 0.09 0.13 0.74 1.03 8.00 0.22 985.66 153.25 1580 0.47 94.48 dm-2 0.00 0.00 9.25 575.59 129.18 4604.83 8.09 6.09 9.74 74.24 8 0.61 82.37 dm-3 0.00 0.00 0.12 4.74 21.57 37.89 12.24 2.52 518.00 131.58 527 0.23 93.15 [root@opmantek ~]#
This problem was solved with moving the MV to an environment with solid state disks, the client validated that the MV was using mechanical disks (HDD), so a clone of a laboratory MV does not work since it is presented the same Problem, when replacing HDD disks to solid state disks, the MV and the monitoring services stabilize, the RAM memory, CPU and disk utilization is normal, this according to the nodes that the monitoring system is monitoring .