Managing a large complex environment with ever changing operational states is challenging, to assist, NMIS as a Network Management System which is performing performance management and fault management simultaneously monitors the health and operational status of devices and creates several individual metrics as well as an over all metric for each device. This article explains what those metrics are and what they mean.
...
All of these metrics are weighted and a health metric is created. This metric when compared over time should always indicate the relative health of the device. Interfaces which aren't being used should be shutdown so that the health metric remains realistic. The exact calculations can be seen in the runReachability runReach subroutine in nmis.pl.
Metric Details
Many people wanted network availability and many tools generated availability based on ping statistics and claimed success. This however was a poor solution, for example, the switch running the management server would down and the management server would report that the whole network was down, which of course it wasn't. OR worse, a device would be responding to a PING but many of its interfaces were down, so while it was reachable, it wasn't really available.
...
The health metric uses items starting with "weight_" to weight the values into the health metric. The overall metric combines health, availability and reachability into a single metric for each device and for each group and ultimately the entire network.
If more weight should be given to interface utilisation and less to interface availability, these metrics can be tuned, so for example weight_availability could become 0.05 and weight_int could become 0.25, the resulting weights (weight_*) should add up to 100.
Metric Calculations Examples
Health Example
At the completion of a poll cycle for a node, some health metrics which have been cached are ready for calculating the health metric of a node, so lets say the results for a router were:
- CPU = 20%
- Availability = 90%
- All Interface Utilisation = 10%
- Memory Free = 20%
- Response Time = 50ms
- Reachability = 100%
The first step is that the measured values are weighted so that they can be compared correctly. So if the CPU load is 20%, the weight for the health calculation will become 90%, if the response time is 100ms it will become 100%, but a response time of 500ms would become 60%, there is a subroutine weightResponseTime for the this calculation.
So the weighted values would become:
- Weighted CPU = 90%
- Weighted Availability = 90% (does not require weighting, already in % where 100% is good)
- Weighted Interface Utilisation = 90% (100 less the actual total interface utilisation)
- Weighted Memory = 60%
- Weighted Response Time = 100%
- Weighted Reachability = 100% (does not require weighting, already in % where 100% is good)
NB. For servers, the interface weight is divided by two, and used equally for interface utilisation and disk free.
These values are now dropped into the final calculation:
weight_cpu * 90 + weight_availability * 90 + weight_int * 90 + weight_mem * 60 + weight_response * 100 + weight_reachability * 100
which becomes "0.1 * 90 + 0.1 * 90 + 0.2 * 90 + 0.1 * 60 + 0.2 * 100 + 0.3 * 100" resulting in 92% for the health metric
The calculations can be seen in the collect debug, nmis.pl type=collect node=<NODENAME> debug=true
Code Block |
---|
09:08:36 runReach, Starting node meatball, type=router
09:08:36 runReach, Outage for meatball is
09:08:36 runReach, Getting Interface Utilisation Health
09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=200 count=1
09:08:36 runReach, Intf Summary in=0.06 out=0.55 intsumm=399.39 count=2
09:08:36 runReach, Intf Summary in=8.47 out=5.81 intsumm=585.11 count=3
09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=785.11 count=4
09:08:36 runReach, Intf Summary in=0.06 out=0.56 intsumm=984.49 count=5
09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=1184.49 count=6
09:08:36 runReach, Intf Summary in=8.47 out=6.66 intsumm=1369.36 count=7
09:08:36 runReach, Intf Summary in=0.05 out=0.56 intsumm=1568.75 count=8
09:08:36 runReach, Calculation of health=96.11
09:08:36 runReach, Reachability and Metric Stats Summary
09:08:36 runReach, collect=true (Node table)
09:08:36 runReach, ping=100 (normalised)
09:08:36 runReach, cpuWeight=90 (normalised)
09:08:36 runReach, memWeight=100 (normalised)
09:08:36 runReach, intWeight=98.05 (100 less the actual total interface utilisation)
09:08:36 runReach, responseWeight=100 (normalised)
09:08:36 runReach, total number of interfaces=24
09:08:36 runReach, total number of interfaces up=7
09:08:36 runReach, total number of interfaces collected=8
09:08:36 runReach, total number of interfaces coll. up=6
09:08:36 runReach, availability=75
09:08:36 runReach, cpu=13
09:08:36 runReach, disk=0
09:08:36 runReach, health=96.11
09:08:36 runReach, intfColUp=6
09:08:36 runReach, intfCollect=8
09:08:36 runReach, intfTotal=24
09:08:36 runReach, intfUp=7
09:08:36 runReach, loss=0
09:08:36 runReach, mem=61.5342941922784
09:08:36 runReach, operCount=8
09:08:36 runReach, operStatus=600
09:08:36 runReach, reachability=100
09:08:36 runReach, responsetime=1.32 |
Metric Example
The metric calculations is much more straight foward, these calculations are done in in a subroutine called getGroupSummary in NMIS.pm, for each node the availability, reachability and health are extracted from the nodes "reach" RRD file, and then weighted according to the configuration weights.
So based on our example before, the node would have the following values:
- Health = 92%
- Availability = 90%
- Reachability = 100%
The formula would become, "metric_health * 92 + metric_availability * 90 + metric_reachability * 100", resulting in "0.4 * 92 + 0.2 * 90 + 0.4 * 100 = 94.8", So a metric of 94.8 for this node, which is averaged with all the other nodes in this group, or the whole network to result in the metric for each group and the entire network.