Root Cause Dependency Analysis for Events
Introduction
The root cause dependency analysis (RCA) feature enables opEvents to identify multiple events which have a single root cause and suppress all but the root cause event.
For example, if 20 servers depend on a switch which depends on a router, and the interface on the switch which connects to the router goes down, NMIS will be unable to poll the switch or the 20 servers and it will raise 21 Node Down events. With root cause dependency analysis enabled, opEvents will automatically acknowledge the 20 Node Down events for the 20 Servers, so that the operators can focus on the Root Cause, the Node Down event for the switch.
Steps
Overview
RCA functionality in OMK products works in a few steps (each can be useful on their own).
opCharts: Generate a Dependency Map from existing lldp/cdp data. Tequires:
“real life” dependency data
OR data restored from another system
opCharts: Apply the Dependency Map data to the NMIS Node in the “depend” property
opCharts: Calculate and apply polling_groups so NMIS pings dependent nodes near the same time, so that dependencies change state in the same polling cycle, minimising lag.
opEvents: Enable RCA Policy to group any Node Down events using “depend” property so only the node most depended on (the root node) has an active Node Down event, all other nodes depending on this root node have their Node Down event auto-acknowledged
All steps below assume that opEvents, opCharts and NMIS versions that support the feature have been installed. The RCA functionality is available from these versions onwards:
opEvents 4.6.0
opCharts 4.9.0
NMIS 9.6.1
Generate a dependency graph/map from existing lldp/cdp data
If opCharts is installed on a system that has LLDP or CDP information and then you are ready to build a dependency map
Now we’re ready to make a dependency map (note, this will overwrite existing maps named “test”) for the whole system:
/usr/local/omk/bin/opcharts-cli.pl act=build_map_from_inventory map_name=test poller_ip="$IP"
# Replace $IP with the IP address of the NMIS server that is polling your devices
# this command also allows providing filters so a system could have many different dependency
# maps which cover different parts of the system (instead of one large one)The map can also be created by navigating to opCharts → Views-> Maps and creating a new map. Then selecting “Dependency Map” map type and pressing “Build Dependency Map From Inventory”.
Now load the opCharts GUI, navigate to the maps page, you should see a map named test. This GUI is here for debugging purposes. It does allow deleting, re-arranging and saving.
Apply this graph/map to node configuration “depend” settings
First, if you have existing depend settings that are no longer valid or if you’ve run the steps before and are trying again, clear the existing depend settings from all nodes in the system:
/usr/local/omk/bin/opcharts-cli.pl act=clear_node_depends
# this command accepts filter paramters to limit what is being cleared, check the helpNow tell opCharts to apply a dependencies from a named map to the node configuration depend settings, this will also automatically update the polling_groups setting in the nodes.
/usr/local/omk/bin//opcharts-cli.pl act=apply_dependency_map_to_nodes map_name=testDepend settings can also be applied in the GUI when editing the dependency map using the “Apply Dependency Map to Nodes” button
To see the changes, load the admin tool, navigate to the nodes page, find a node that should have new settings, edit it and check the advanced tab
Calculate and apply polling_groups so NMIS pings related nodes at the same time
opCharts will do this automatically when applying a dependency map to nodes. If you manually adjust the depend settings for a node you will want to do this step to update the polling groups.
This step is not mandatory but should help related node down events arrive together
/usr/local/omk/bin/opcharts-cli.pl act=assign-polling-groupsAfter running this you can load an API to see the group polling_group setting has been set
http://$server/omk/opCharts/v2/nodes?properties=["configuration.polling_group"]opEvents: Group Node Down events using depend settings
opEvents requires these opCommon.json settings to be changed for RCA code to run:
# tell the system to run RCA code
"apply_rca_policy" : "true",
# amount of time to wait for down events to come before releasing Down event of "root" node
"rca_delay_seconds" : 20,
systemctl restart opeventsd
Now if you have used the restored data and the nodes you can fake node up/down events (one of these nodes depends on the other in the map that was generated)
# up
echo `date +%s`,router1_2,Node Up,Normal,,Ping failed Time=00:00:00 >> /usr/local/nmis9/logs/event.log
echo `date +%s`,router1_3,Node Up,Normal,,Ping failed Time=00:00:00 >> /usr/local/nmis9/logs/event.log
sleep 60
# down
echo `date +%s`,router1_2,Node Down,Major,,Ping failed >> /usr/local/nmis9/logs/event.log
echo `date +%s`,router1_3,Node Down,Major,,Ping failed >> /usr/local/nmis9/logs/event.log Navigate to the “Current Events” page in opEvents, you should see node up events, then node down events where 1 event is marked “acknowledged”, if you click on the event you will see the acknowledged action “automatically acknowledged by RCA policy, RCA node: router1_2”. If you click on the active event for router1_2, the description will have Root Event,Nodes Affected: and a list of nodes affected. The GUI should also have a link to view the Node Dependency Map if opCharts is installed and version 4.9.0+ is installed.