Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Managing a large complex environment with ever changing operational states is challenging, to assist, NMIS as a Network Management System which is performing performance management and fault management simultaneously monitors the health and operational status of devices and creates several individual metrics as well as an over all metric for each device.  This article explains what those metrics are and what they mean.  

...

If more weight should be given to interface utilisation and less to interface availability, these metrics can be tuned, so for example weight_availability could become 0.05 and weight_int could become 0.25, the resulting weights (weight_*) should add up to 100.

Other Metrics Configuration Options

Introduced in NMIS 8.5.2G are some additional configuration options to help how this all works, and to make it more or less responsive. The first two options are metric_comparison_first_period and metric_comparison_second_period, which are by default -8 hours and -16 hours.  

These are the two main variables which control the comparisons you see in NMIS, the real time health baselining.  These two options will be calculations made from time now to time metric_comparison_first_period (8 hours ago) to calculations made from metric_comparison_first_period (8 hours ago) to metric_comparison_second_period (16 hours ago).

This means NMIS is comparing in realtime data from the last hour 8 hours to the 8 hour period before that.  You can make this smaller or longer periods of time.  In the lab I am running -4 hours and -8 hours, which makes the metrics a little more responsive to load and change.

The other new configuration option is metric_int_utilisation_above which is -1 by default.  This means that interfaces with 0 (zero) utilisation will be counted into the overall interface utilisation metrics.  So if you have a switch with 48 interfaces all active but basically no utilisation, and two uplinks with 5 to 10% load, the average utilisation of the 48 interfaces is very low, so now we pick the highest of input and output utilisation and only add interfaces with utilisation above this configured amount, setting to 0.5 should produce more dynamic health metrics.

Metric Calculations Examples

...

Code Block
09:08:36 runReach, Starting node meatball, type=router
09:08:36 runReach, Outage for meatball is 
09:08:36 runReach, Getting Interface Utilisation Health
09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=200 count=1
09:08:36 runReach, Intf Summary in=0.06 out=0.55 intsumm=399.39 count=2
09:08:36 runReach, Intf Summary in=8.47 out=5.81 intsumm=585.11 count=3
09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=785.11 count=4
09:08:36 runReach, Intf Summary in=0.06 out=0.56 intsumm=984.49 count=5
09:08:36 runReach, Intf Summary in=0.00 out=0.00 intsumm=1184.49 count=6
09:08:36 runReach, Intf Summary in=8.47 out=6.66 intsumm=1369.36 count=7
09:08:36 runReach, Intf Summary in=0.05 out=0.56 intsumm=1568.75 count=8
09:08:36 runReach, Calculation of health=96.11
09:08:36 runReach, Reachability and Metric Stats Summary
09:08:36 runReach, collect=true (Node table)
09:08:36 runReach, ping=100 (normalised)
09:08:36 runReach, cpuWeight=90 (normalised)
09:08:36 runReach, memWeight=100 (normalised)
09:08:36 runReach, intWeight=98.05 (100 less the actual total interface utilisation)
09:08:36 runReach, responseWeight=100 (normalised)
09:08:36 runReach, total number of interfaces=24
09:08:36 runReach, total number of interfaces up=7
09:08:36 runReach, total number of interfaces collected=8
09:08:36 runReach, total number of interfaces coll. up=6
09:08:36 runReach, availability=75
09:08:36 runReach, cpu=13
09:08:36 runReach, disk=0
09:08:36 runReach, health=96.11
09:08:36 runReach, intfColUp=6
09:08:36 runReach, intfCollect=8
09:08:36 runReach, intfTotal=24
09:08:36 runReach, intfUp=7
09:08:36 runReach, loss=0
09:08:36 runReach, mem=61.5342941922784
09:08:36 runReach, operCount=8
09:08:36 runReach, operStatus=600
09:08:36 runReach, reachability=100
09:08:36 runReach, responsetime=1.32
Metric Example

The metric calculations is much more straight foward, these calculations are done in in a subroutine called getGroupSummary in NMIS.pm, for each node the availability, reachability and health are extracted from the nodes "reach" RRD file, and then weighted according to the configuration weights.

...