Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

TSW

Esta página está destinada a proporcionar información de la instalción y uso del Troubleshooting Wizard, que tiene como objetivo ayudar a los clientes a ejecutar un diagnóstico completo de su(s) servidor(es) para determinar la causa raíz de algún probable problema que se esté presentando.

Abarcaremos desde la descarga e implementación del archivo de Troubleshooting, pasando por el análisis completo del servidor utilizando cada uno de los menús interactivos del programa.

Este documento está basado en las pruebas mencionadas en la página Proceso de resolución de problemas de dispositivos en NMIS.

Table of Contents

Descripción del problema

Para poder ejecutar el Troubleshooting Wizard, debemos tener una descripción muy bien detallada por parte del cliente que nos dé un buen panorama de la situación que está presentando.

Para esto, debemos resolver algunas preguntas importantes:

a) Descripción de la incidencia: ¿qué está sucediendo? ¿desde cuándo?

b) ¿En qué servidor o servidores se está presentando la incidencia?

...

This page is intended to provide information on the installation and use of the Troubleshooting Wizard, which is intended to help customers run a full diagnostic of their server(s) to determine the root cause of any likely problem being encountered. presenting.

We will cover from the download and implementation of the Troubleshooting file, going through the complete analysis of the server using each of the interactive menus of the program.

This document is based on the tests mentioned on the NMIS Device Troubleshooting Process page.

Table of Contents

Description of the problem

In order to run the Troubleshooting Wizard, we must have a very detailed description from the client that gives us a good overview of the situation that it is presenting.

For this, we must solve some important questions:

a) Incident description: what is happening? since when?

b) On which server or servers is the incident occurring?

c) Description of the server or servers in which the incident is occurring: CPUs, RAM, DD,

...

total nodes.

d)

...

In which node or nodes is the incident occurring?

e)

...

Additional details, for example: current NMIS cron settings at least, current settings in /etc/mongod.conf

...

file, database parameters settings in /usr/local/omk/conf/opCommon.nmis,

...

if any file was recently modified, if any configuration made either on the server or on the computers caused the incident.

Troubleshooting Wizard:

...

Install and run

The installation file (.sh)

...

can be obtained from the following GitHub link: https://github.com/tom-tics/TS_Wizard_NMIS8_OPMANTEK

...

Debe descargarse y subirse al servidor en el cual se quiera realizar el análisis mediante un cliente FTP (como FileZilla), a la carpeta deseada por el cliente.

...

 .

It must be downloaded and uploaded to the server where you want to perform the analysis using an FTP client (such as FileZilla), to the folder desired by the client.

Once the file is already on the server, we execute it with the command: sh 01_TS_Wizard_OMK.sh

TSW

Una vez que hayamos ejecutado el archivo, accederemos a la pantalla inicial, donde se muestran los detalles del Sistema Operativo, como es la versión de Linux y un pequeño resumen de la memoria y CPU del sistema.

De igual forma, se muestra el Menú Principal, donde encontraremos las diferentes opciones a las que podemos acceder:

...

Once we have executed the file, we will access the initial screen, where the details of the Operating System are shown, such as the version of Linux and a small summary of the memory and CPU of the system.

In the same way, the Main Menu is shown, where we will find the different options that we can access:

  1. Execute Healthcheck: we can perform a complete review of the server.
  2. NMIS Configuration Consistency: we will be able to review the consistency of the most important NMIS configuration files.
  3. Nodes Troubleshooter: we will be able to review the behavior of the nodes added to NMIS.
  4. Smart Diagnostic: creates a full system diagnostic in a .tar.gz file, which can be attached in case a ticket needs to be opened with Opmantek Support
  5. Create System Backup File: creates a .tar.gz file that will contain a backup of the /etc/* and /usr/local/* folders.
  6. Execute Support Automation Tool:

...

  1. generates an NMIS and an OMK support file, which can be attached in case a ticket needs to be opened with Opmantek Support.

Troubleshooting Wizard:

...

Features

1. Execute Healthcheck

Se puede elegir entre diferentes opciones, las cuales se muestran a continuaciónYou can choose between different options, which are shown below:

TSW

1. TOP

Este comando nos da información de todos los procesos que se están ejecutando en este momento en el servidor y el porcentaje de utilización de CPU y memoria RAM.

Siempre será importante basarnos en el load average y en el %CPU, ya que si estos valores son altos, tendremos seguramente un problema en algún o algunos procesos que se están ejecutando actualmente.

Al final de la ejecución del comando, nos muestra una serie de tips, como son:

...

This command gives us information on all the processes that are currently running on the server and the percentage of CPU and RAM memory utilization.

It will always be important to base ourselves on the load average and the %CPU, since if these values ​​are high, we will surely have a problem in one or more processes that are currently running.

At the end of the execution of the command, it shows us a series of tips, such as:

  • Check disk partitions.
  • Clean registry files that take up too much space.
  • Delete cache.

TSW

2. System date and time

Es muy importante que el servidor tenga configurada de forma correcta la fecha y hora, según la zona horaria de cada cliente; esto debido a que hay muchos procesos que se ejecutan en lapsos específicos de tiempo y, de igual forma, los logs del sistema y los registros de modificación de archivos, contienen marcas de tiempo para poder detectarse en caso de un error.

...

It is very important that the server has the correct date and time configured, according to the time zone of each client; This is due to the fact that there are many processes that are executed in specific periods of time and, in the same way, the system logs and the file modification records contain timestamps to be able to detect in case of an error.

That is why this section is included so that the operator knows that the system date and time are adequate. At the end, in case the server does not have NTP activated, a tip is displayed to contact the system administrator and verify it.

TSW

3. Disk R/W

Con este análisis, podremos darnos cuenta si existe una falla física en los discos del servidor.

...

With this analysis, we will be able to realize if there is a physical failure in the server's disks.

The program executes the commands:

  • dd if=/dev/zero of=/data/omkTestFile bs=10M count=1 oflag=direct
  • dd if=/data/omkTestFile of=/dev/null 2>&1

...

And then it shows the output, which has to be compared with the values:

  • 0.0X s,

...

  • correct parameters.
  • 0.X s,

...

  • there is a warning (and could cause a problem).
  • X.0 s,

...

  • it is critical (and there is a problem).

...

Similarly, an iostat -x 5 4

...

is run, which is used to monitor the IO load of the system machine. If you have a high %util, it is very likely that there is a problem that could even lead to data loss, which is signaled at the end of the command execution.

TSW

4. Filesystem

Muestra un análisis detallado del espacio en cada uno de los filesystems del sistema, esto para comprobar que la posible incidencia no se esté presentando por una falta de espacio en el servidor. También muestra un tip de que si el uso es +85% en alguno de los ficheros, se contacte al administrador para que se puedan depurar.

...

It shows a detailed analysis of the space in each of the system's filesystems, to verify that the possible incident is not occurring due to a lack of space on the server. It also shows a tip that if the use is +85% in any of the files, contact the administrator so that they can be debugged.

In the same way, it executes a command to know the use of the system's RAM and swap memory, showing a tip if there is a high percentage of use to contact the administrator and see what is happening.

TSW

5. Service status

Se ejecuta una revisión de cada uno de los demonios del sistema, para comprobar que todos los procesos esenciales se estén ejecutando de manera correcta.

...

A check of each of the system daemons is run to verify that all essential processes are running correctly.

The following commands are executed:

  • service omkd status
  • service mongod status
  • service nmisd status (

...

  • if applicable)
  • service nmis9d status (

...

  • if applicable)
  • service httpd status
  • service opchartsd status
  • service opeventsd status
  • service opconfigd status
  • service opflowd status
  • service crond status
  • service snmpd status
  • service iptables status

...

Similarly, check that SELinux is disabled.

TSW

En caso de que se detecte que algún servicio esté down y sea importante para el funcionamiento del sistema, debe reiniciarse como indica el script.

...