Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Managing a large complex environment with ever changing operational states is challenging, to assist, NMIS as a Network Management System which is performing performance management and fault management simultaneously monitors the health and operational status of devices and creates several individual metrics as well as an over all metric for each device.  This article explains what those metrics are and what they mean.  

...

Code Block
'metrics' => {
  'weight_cpuavailability' => '0.1',
  'weight_availabilitycpu' => '0.12',
  'weight_int' => '0.23',
  'weight_mem' => '0.1',
  'weight_response' => '0.2',
  'weight_reachability' => '0.31',
  'metric_health' => '0.4',
  'metric_availability' => '0.2',
  'metric_reachability' => '0.4',
  'average_decimals' => '2',
  'average_diff' => '0.1',
},

...

weight_cpu * 90 + weight_availability * 90 + weight_int * 90 + weight_mem * 60 + weight_response * 100 + weight_reachability * 100

which becomes "0.1 2 * 90 + 0.1 * 90 + 0.2 3 * 90 + 0.1 * 60 + 0.2 * 100 + 0.3 1 * 100" resulting in 92% 90% for the health metric

The calculations can be seen in the collect debug, nmis.pl type=collect node=<NODENAME> debug=true

Code Block
09:08:36 runReach, Starting node meatball, type=router
09:08:36 runReach, Outage for meatball is 
09:08:36 runReach, Getting Interface Utilisation Health
09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=200 count=1
09:08:36 runReach, Intf Summary in=0.06 out=0.55 intsumm=399.39 count=2
09:08:36 runReach, Intf Summary in=8.47 out=5.81 intsumm=585.11 count=3
09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=785.11 count=4
09:08:36 runReach, Intf Summary in=0.06 out=0.56 intsumm=984.49 count=5
09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=1184.49 count=6
09:08:36 runReach, Intf Summary in=8.47 out=6.66 intsumm=1369.36 count=7
09:08:36 runReach, Intf Summary in=0.05 out=0.56 intsumm=1568.75 count=8
09:08:36 runReach, Calculation of health=96.11
09:08:36 runReach, Reachability and Metric Stats Summary
09:08:36 runReach, collect=true (Node table)
09:08:36 runReach, ping=100 (normalised)
09:08:36 runReach, cpuWeight=90 (normalised)
09:08:36 runReach, memWeight=100 (normalised)
09:08:36 runReach, intWeight=98.05 (100 less the actual total interface utilisation)
09:08:36 runReach, responseWeight=100 (normalised)
09:08:36 runReach, total number of interfaces=24
09:08:36 runReach, total number of interfaces up=7
09:08:36 runReach, total number of interfaces collected=8
09:08:36 runReach, total number of interfaces coll. up=6
09:08:36 runReach, availability=75
09:08:36 runReach, cpu=13
09:08:36 runReach, disk=0
09:08:36 runReach, health=96.11
09:08:36 runReach, intfColUp=6
09:08:36 runReach, intfCollect=8
09:08:36 runReach, intfTotal=24
09:08:36 runReach, intfUp=7
09:08:36 runReach, loss=0
09:08:36 runReach, mem=61.5342941922784
09:08:36 runReach, operCount=8
09:08:36 runReach, operStatus=600
09:08:36 runReach, reachability=100
09:08:36 runReach, responsetime=1.32
Metric Example

The metric calculations is much more straight foward, these calculations are done in in a subroutine called getGroupSummary in NMIS.pm, for each node the availability, reachability and health are extracted from the nodes "reach" RRD file, and then weighted according to the configuration weights.

So based on our example before, the node would have the following values:

  • Health = 92%90%
  • Availability = 90%
  • Reachability = 100%

The formula would become, "metric_health * 92 90 + metric_availability * 90 + metric_reachability * 100", resulting in "0.4 * 92 90 + 0.2 * 90 + 0.4 * 100 = 94.8", So a metric of 94 .8 for this node, which is averaged with all the other nodes in this group, or the whole network to result in the metric for each group and the entire network.

...