Managing a large complex environment with ever changing operational states is challenging, to assist, NMIS as a Network Management System which is performing performance management and fault management simultaneously monitors the health and operational status of devices and creates several individual metrics as well as an over all metric for each device. This article explains what those metrics are and what they mean.
Table of Contents |
---|
Summary
Consider this in the context that a network device offers a service, the service it offers is connectivity, while a router or switch is up and all the interfaces are available, it is truly up, and when it has no CPU load it is healthy, as the interfaces get utilised and the CPU is busy, it has less capacity remaining. The following statistics are considered part of the health of the device:
...
Code Block |
---|
09:08:36 runReach, Starting node meatball, type=router 09:08:36 runReach, Outage for meatball is 09:08:36 runReach, Getting Interface Utilisation Health 09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=200 count=1 09:08:36 runReach, Intf Summary in=0.06 out=0.55 intsumm=399.39 count=2 09:08:36 runReach, Intf Summary in=8.47 out=5.81 intsumm=585.11 count=3 09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=785.11 count=4 09:08:36 runReach, Intf Summary in=0.06 out=0.56 intsumm=984.49 count=5 09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=1184.49 count=6 09:08:36 runReach, Intf Summary in=8.47 out=6.66 intsumm=1369.36 count=7 09:08:36 runReach, Intf Summary in=0.05 out=0.56 intsumm=1568.75 count=8 09:08:36 runReach, Calculation of health=96.11 09:08:36 runReach, Reachability and Metric Stats Summary 09:08:36 runReach, collect=true (Node table) 09:08:36 runReach, ping=100 (normalised) 09:08:36 runReach, cpuWeight=90 (normalised) 09:08:36 runReach, memWeight=100 (normalised) 09:08:36 runReach, intWeight=98.05 (100 less the actual total interface utilisation) 09:08:36 runReach, responseWeight=100 (normalised) 09:08:36 runReach, total number of interfaces=24 09:08:36 runReach, total number of interfaces up=7 09:08:36 runReach, total number of interfaces collected=8 09:08:36 runReach, total number of interfaces coll. up=6 09:08:36 runReach, availability=75 09:08:36 runReach, cpu=13 09:08:36 runReach, disk=0 09:08:36 runReach, health=96.11 09:08:36 runReach, intfColUp=6 09:08:36 runReach, intfCollect=8 09:08:36 runReach, intfTotal=24 09:08:36 runReach, intfUp=7 09:08:36 runReach, loss=0 09:08:36 runReach, mem=61.5342941922784 09:08:36 runReach, operCount=8 09:08:36 runReach, operStatus=600 09:08:36 runReach, reachability=100 09:08:36 runReach, responsetime=1.32 |
Metric Example
The metric calculations is much more straight foward, these calculations are done in in a subroutine called getGroupSummary in NMIS.pm, for each node the availability, reachability and health are extracted from the nodes "reach" RRD file, and then weighted according to the configuration weights.
...