Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

The Baseline Tool now ships with the latest versions of opCharts for NMIS8 and NMIS9.

Why we need a Dynamic Baseline and Thresholding Tool

Forewarned is forearmed the proverb goes, a quick google tells me "prior knowledge of possible dangers or problems gives one a tactical advantage".  The reason we want to baseline and threshold our data is so that we can receive alerts forewarning us of issues in our environment, so that we can act to resolve smaller issues before they become bigger.  Being proactive increases our Mean Time Between Failure.

...

When a metric remains to the same level for an extended period, it is called a flatline detection. This means, the standard deviation is 0.

  • '"threshold_period' => " : "-60 minutes" # Default -15 min
  • '"threshold_std_deviation' => " :  0.001, # Or 0. It checks the standard deviation (stddev)
  • '"threshold_exceeds' => " : 2, # Or ignored. If not set, it will create an event every time it detects a flatline.
  • '"threshold_level' => 'critical' " : "critical" # Or Major by default

Flatline example: 

...

Flatline example with threshold exceed: 

Example:

Code Block
'"ifInErrors'" =>: {
    '"baseline'" =>: '"flatline'",
    '"active'" =>: '"true'",
    '"metric'" =>: '"ifInErrors'",
    '"type'" =>: '"pkts_hc'",
    '"nodeModel'" =>: '"CiscoRouter|CatalystIOS|CiscoNXOS'",
    '"use_index'" =>: '"interface'",
    '"event'" =>: '"Proactive Output Discards (flatline)'",
    '"indexed'" =>: '"true'",
    '"threshold_std_deviation'" =>: 0.001,
    '"threshold_period'" =>: "-60 minutes",
    '"threshold_exceeds'" =>: "20"
  },

Simple Baseline

The simple baseline just detects when the average of a selected period raises a threshold level. 

...

Example:

Example:

Code Block
  '"ifInErrors'" =>: {
    '"baseline'" =>: '"simplethreshold'",
    '"active'" =>: '"true'",
    '"metric'" =>: '"ifInErrors'",
    '"type'" =>: '"pkts_hc'",
    '"nodeModel'" =>: '"CiscoRouter|CatalystIOS|CiscoNXOS'",
    '"use_index'" =>: '"interface'",
    '"event'" =>: '"Proactive Output Discards (simplethreshold)'",
    '"indexed'" =>: '"true'",
    '"threshold_period'" =>: "-120 minutes",
    '"levels'" =>: {
      '"Warning'" =>: 10,
      '"Minor'" =>: 20,
      '"Major'" =>: 30,
      '"Critical'" =>: 40,
      '"Fatal'" =>: 50
    }
  }, 

In the above graph, that would be a Fatal alert. 

...

Configuration of the baseline tool is done in the file /usr/local/omk/conf/Baseline.nmis json the default configuration should be installed when the tool is installed.

...

Here is what the configuration file would look like, this example is a Same-Day Baseline:

Code Block
  '"RouteNumber'" =>: {
    '"active'" =>: '"true'",
    '"metric'" =>: '"RouteNumber'",
    '"type'" =>: '"RouteNumber'",
    '"nodeModel'" =>: '"CiscoRouter'",
    '"event'" =>: '"Proactive Route Number Change'",
    '"indexed'" =>: '"false'",
    '"threshold_exceeds'" =>: undef,
    '"threshold_period'" =>: "-5 minutes",
    '"multiplier'" =>: 1,
    '"weeks'" =>: 0,
    '"hours'" =>: 8,
  },

Multi-Day Dynamic Baseline Configuration Example

Another configuration option using the BGP Prefixes being exchanged with BGP peers, is from systemHealth modelling and this is a multi-day baseline:

Code Block
  '"cbgpAcceptedPrefix'" =>: {
    '"active'" =>: '"true'",
    '"metric'" =>: '"cbgpAcceptedPrefix'",
    '"type'" =>: '"bgpPrefix'",
    '"section'" =>: '"bgpPrefix'",
    '"nodeModel'" =>: '"CircuitMonitor|CiscoRouter'",
    '"event'" =>: '"Proactive BGP Peer Prefix Change'",
    '"indexed'" =>: '"true'",
    '"multiplier'" =>: 1,
    '"weeks'" =>: 4,
    '"hours'" =>: 1,
  },

Delta Baseline Configuration Example

Currently delta baselines do not support multi-day, but the hours value can be very large if required.

Code Block
  '"hrSystemProcesses'" =>: {
    '"baseline'" =>: '"delta'",
    '"active'" =>: '"true'",
    '"metric'" =>: '"hrSystemProcesses'",
    '"type'" =>: '"Host_Health'",
    '"nodeModel'" =>: '"net-snmp'",
    '"indexed'" =>: '"false'",
    '"hours'" =>: 4,
    '"threshold_period'" =>: "-15 minutes",
    '"levels'" =>: {
      '"Warning'" =>: 10,
      '"Minor'" =>: 20,
      '"Major'" =>: 30,
      '"Critical'" =>: 40,
      '"Fatal'" =>: 50
    }
  },

Delta Baseline for Output Packets Discarded Configuration Example

Currently delta baselines do not support multi-day, but the hours value can be very large if required.

Code Block
  '"ifOutDiscards'" =>: {
    '"baseline'" =>: '"delta'",
    '"active'" =>: '"true'",
    '"metric'" =>: '"ifOutDiscards'",
    '"type'" =>: '"pkts_hc'",
    '"use_index'" =>: '"interface'",
    '"nodeModel'" =>: 'CiscoRouter'",
    '"event'" =>: '"Proactive Output Discards (Delta)'",
    '"indexed'" =>: '"true'",
    '"hours'" =>: 1,
    '"threshold_period'" =>: "-15 minutes",
    '"levels'" =>: {
      'Warning'" =>: 1,
      'Minor'" =>: 2,
      'Major'" =>: 3,
      'Critical'" =>: 4,
      'Fatal'" =>: 7
    }
  },

Running the Baseline Tool

...

Code Block
/usr/local/omk/bin/baseline.plexe act=run

There are some debug options to see a little more detail, debug=true, debug=2 or debug=3 are the current levels of verbosity.

...

Code Block
#
# this cron schedule runs the baseline system every 5 minutes.
#
#
# if you DON'T want any NMIS cron mails to go to root, 
# uncomment and adjust the next line
#MAILTO=prefered@domain.com
#
# m h dom month dow user command
#
# run the baseline every 5 minutes starting at 4 minutes offset from the hour.
4-59/5 * * * * root "/usr/local/omk/bin/baseline.exe" act=run > "/usr/local/omk/log/baseline.log" 2>&1

Using Group Regex and Cron for Parallel Processing.

...

Code Block
# run the baseline every 5 minutes starting at 3 and 4 minutes offset from the hour.
3-58/5 * * * * root /usr/local/omk/bin/baseline.exe act=run group_regex="Core|Dist" > /usr/local/omk/log/baseline1.log 2>&1
4-59/5 * * * * root /usr/local/omk/bin/baseline.exe act=run group_regex="Access" > /usr/local/omk/log/baseline2.log 2>&1

Image Removed

...