Table of Contents |
---|
The Baseline Tool now ships with the latest versions of opCharts for NMIS8 and NMIS9.
Why we need a Dynamic Baseline and Thresholding Tool
Forewarned is forearmed the poverb proverb goes, a quick google tells me "prior knowledge of possible dangers or problems gives one a tactical advantage". The reason we want to baseline and threshold our data is so that we can receive alerts forewarning us of issues in our environment, so that we can act to resolve smaller issues before they become bigger. Being proactive increases our Mean Time Between Failure.
...
In practicality this spike was brief and using the 15 minute threshold period (current is the average of the last 15 minutes) the value for calculating change would be 136 and the resulting change would be 36% so a Major event. The threshold period is dampening the spikes to remove brief changes and allow you to see changes which last longer.
...
Flatline Baseline
...
The baseline tool is installed with recent versions of opCharts.
Working with the Dynamic Baseline and Thresholding Tool
...
Supported from opCharts 3.6.1.
When a metric remains to the same level for an extended period, it is called a flatline detection. This means, the standard deviation is 0.
- "threshold_period" : "-60 minutes" # Default -15 min
- "threshold_std_deviation" : 0.001, # Or 0. It checks the standard deviation (stddev)
- "threshold_exceeds" : 2, # Or ignored. If not set, it will create an event every time it detects a flatline.
- "threshold_level" : "critical" # Or Major by default
Flatline example:
The first flatline would be detected just when threshold_std_deviation is 10 in the example.
Flatline example with threshold exceed:
Example:
Code Block |
---|
"ifInErrors" : {
"baseline" : "flatline",
"active" : "true",
"metric" : "ifInErrors",
"type" : "pkts_hc",
"nodeModel" : "CiscoRouter|CatalystIOS|CiscoNXOS",
"use_index" : "interface",
"event" : "Proactive Output Discards (flatline)",
"indexed" : "true",
"threshold_std_deviation" : 0.001,
"threshold_period" : "-60 minutes",
"threshold_exceeds" : 20
}, |
Simple Baseline
The simple baseline just detects when the average of a selected period raises a threshold level.
- threshold_period
- levels
Example:
Example:
Code Block |
---|
"ifInErrors" : {
"baseline" : "simplethreshold",
"active" : "true",
"metric" : "ifInErrors",
"type" : "pkts_hc",
"nodeModel" : "CiscoRouter|CatalystIOS|CiscoNXOS",
"use_index" : "interface",
"event" : "Proactive Output Discards (simplethreshold)",
"indexed" : "true",
"threshold_period" : "-120 minutes",
"levels" : {
"Warning" : 10,
"Minor" : 20,
"Major" : 30,
"Critical" : 40,
"Fatal" : 50
}
}, |
In the above graph, that would be a Fatal alert.
Installing the Baseline Tool
The baseline tool is installed with recent versions of opCharts.
Working with the Dynamic Baseline and Thresholding Tool
The Dynamic Baseline and Threshold Tool includes various configuration options so that you can tune the algorithm to learn differently depending on the metric being used. The tool comes with several metrics already configured. It is a requirement of the system that the stats modeling is completed for the metric you require to be baseline, this is how the NMIS API extracts statistical information from the performance database.
...
Configuration of the baseline tool is done in the file /usr/local/omk/conf/Baseline.nmis json the default configuration should be installed when the tool is installed.
...
Here is what the configuration file would look like, this example is a Same-Day Baseline:
Code Block |
---|
'"RouteNumber'" =>: { '"active'" =>: '"true'", '"metric'" =>: '"RouteNumber'", '"type'" =>: '"RouteNumber'", '"nodeModel'" =>: '"CiscoRouter'", '"event'" =>: '"Proactive Route Number Change'", '"indexed'" =>: '"false'", '"threshold_exceeds'" =>: undef, '"threshold_period'" =>: "-5 minutes", '"multiplier'" =>: 1, '"weeks'" =>: 0, '"hours'" =>: 8, }, |
Multi-Day Dynamic Baseline Configuration Example
Another configuration option using the BGP Prefixes being exchanged with BGP peers, is from systemHealth modelling and this is a multi-day baseline:
Code Block |
---|
'"cbgpAcceptedPrefix'" =>: { '"active'" =>: '"true'", '"metric'" =>: '"cbgpAcceptedPrefix'", '"type'" =>: '"bgpPrefix'", '"section'" =>: '"bgpPrefix'", '"nodeModel'" =>: '"CircuitMonitor|CiscoRouter'", '"event'" =>: '"Proactive BGP Peer Prefix Change'", '"indexed'" =>: '"true'", '"multiplier'" =>: 1, '"weeks'" =>: 4, '"hours'" =>: 1, }, |
Delta Baseline Configuration Example
Currently delta baselines do not support multi-day, but the hours value can be very large if required.
Code Block |
---|
'"hrSystemProcesses'" =>: { '"baseline'" =>: '"delta'", '"active'" =>: '"true'", '"metric'" =>: '"hrSystemProcesses'", '"type'" =>: '"Host_Health'", '"nodeModel'" =>: '"net-snmp'", '"indexed'" =>: '"false'", '"hours'" =>: 4, '"threshold_period'" =>: "-15 minutes", '"levels'" =>: { '"Warning'" =>: 10, '"Minor'" =>: 20, '"Major'" =>: 30, '"Critical'" =>: 40, '"Fatal'" =>: 50 } }, |
Delta Baseline for Output Packets Discarded Configuration Example
Currently delta baselines do not support multi-day, but the hours value can be very large if required.
Code Block |
---|
'"ifOutDiscards'" =>: { '"baseline'" =>: '"delta'", '"active'" =>: '"true'", '"metric'" =>: '"ifOutDiscards'", '"type'" =>: '"pkts_hc'", '"use_index'" =>: '"interface'", '"nodeModel'" =>: 'CiscoRouter'", '"event'" =>: '"Proactive Output Discards (Delta)'", '"indexed'" =>: '"true'", '"hours'" =>: 1, '"threshold_period'" =>: "-15 minutes", '"levels'" =>: { 'Warning'" =>: 1, 'Minor'" =>: 2, 'Major'" =>: 3, 'Critical'" =>: 4, 'Fatal'" =>: 7 } }, |
Running the Baseline Tool
...
Code Block |
---|
/usr/local/omk/bin/baseline.plexe act=run |
There are some debug options to see a little more detail, debug=true, debug=2 or debug=3 are the current levels of verbosity.
...
Code Block |
---|
# # this cron schedule runs the baseline system every 5 minutes. # # # if you DON'T want any NMIS cron mails to go to root, # uncomment and adjust the next line #MAILTO=prefered@domain.com # # m h dom month dow user command # # run the baseline every 5 minutes starting at 4 minutes offset from the hour. 4-59/5 * * * * root "/usr/local/omk/bin/baseline.exe" act=run > "/usr/local/omk/log/baseline.log" 2>&1 |
Using Group Regex and Cron for Parallel Processing.
...