Root Cause Dependency Analysis for Events

Root Cause Dependency Analysis for Events

Introduction

The root cause dependency analysis (RCA) feature enables opEvents to identify multiple events which have a single root cause and suppress all but the root cause event.

For example, if 20 servers depend on a switch which depends on a router, and the interface on the switch which connects to the router goes down, NMIS will be unable to poll the switch or the 20 servers and it will raise 21 Node Down events. With root cause dependency analysis enabled, opEvents will automatically acknowledge the 20 Node Down events for the 20 Servers, so that the operators can focus on the Root Cause, the Node Down event for the switch.

Steps

Overview

RCA functionality in OMK products works in a few steps (each can be useful on their own).

  1. opCharts: Generate a Dependency Map from existing lldp/cdp data. Tequires:

    1. “real life” dependency data

    2. OR data restored from another system

  2. opCharts: Apply the Dependency Map data to the NMIS Node in the “depend” property

  3. opCharts: Calculate and apply polling_groups so NMIS pings dependent nodes near the same time, so that dependencies change state in the same polling cycle, minimising lag.

  4. opEvents: Enable RCA Policy to group any Node Down events using “depend” property so only the node most depended on (the root node) has an active Node Down event, all other nodes depending on this root node have their Node Down event auto-acknowledged

All steps below assume that opEvents, opCharts and NMIS versions that support the feature have been installed. The RCA functionality is available from these versions onwards:

  • opEvents 4.6.0

  • opCharts 4.9.0

  • NMIS 9.6.1

Generate a dependency graph/map from existing lldp/cdp data

If opCharts is installed on a system that has LLDP or CDP information and then you are ready to build a dependency map

Now we’re ready to make a dependency map (note, this will overwrite existing maps named “test”) for the whole system:

/usr/local/omk/bin/opcharts-cli.pl act=build_map_from_inventory map_name=test poller_ip="$IP" # Replace $IP with the IP address of the NMIS server that is polling your devices # this command also allows providing filters so a system could have many different dependency # maps which cover different parts of the system (instead of one large one)

The map can also be created by navigating to opCharts → Views-> Maps and creating a new map. Then selecting “Dependency Map” map type and pressing “Build Dependency Map From Inventory”.

Now load the opCharts GUI, navigate to the maps page, you should see a map named test. This GUI is here for debugging purposes. It does allow deleting, re-arranging and saving.

Apply this graph/map to node configuration “depend” settings

First, if you have existing depend settings that are no longer valid or if you’ve run the steps before and are trying again, clear the existing depend settings from all nodes in the system:

/usr/local/omk/bin/opcharts-cli.pl act=clear_node_depends # this command accepts filter paramters to limit what is being cleared, check the help

Now tell opCharts to apply a dependencies from a named map to the node configuration depend settings, this will also automatically update the polling_groups setting in the nodes.

/usr/local/omk/bin//opcharts-cli.pl act=apply_dependency_map_to_nodes map_name=test

Depend settings can also be applied in the GUI when editing the dependency map using the “Apply Dependency Map to Nodes” button

To see the changes, load the admin tool, navigate to the nodes page, find a node that should have new settings, edit it and check the advanced tab

Calculate and apply polling_groups so NMIS pings related nodes at the same time

opCharts will do this automatically when applying a dependency map to nodes. If you manually adjust the depend settings for a node you will want to do this step to update the polling groups.

This step is not mandatory but should help related node down events arrive together

/usr/local/omk/bin/opcharts-cli.pl act=assign-polling-groups

After running this you can load an API to see the group polling_group setting has been set

http://$server/omk/opCharts/v2/nodes?properties=["configuration.polling_group"]

opEvents: Group Node Down events using depend settings

opEvents requires these opCommon.json settings to be changed for RCA code to run:

# tell the system to run RCA code "apply_rca_policy" : "true", # amount of time to wait for down events to come before releasing Down event of "root" node "rca_delay_seconds" : 20, systemctl restart opeventsd

 

Now if you have used the restored data and the nodes you can fake node up/down events (one of these nodes depends on the other in the map that was generated)

# up echo `date +%s`,router1_2,Node Up,Normal,,Ping failed Time=00:00:00 >> /usr/local/nmis9/logs/event.log echo `date +%s`,router1_3,Node Up,Normal,,Ping failed Time=00:00:00 >> /usr/local/nmis9/logs/event.log sleep 60 # down echo `date +%s`,router1_2,Node Down,Major,,Ping failed >> /usr/local/nmis9/logs/event.log echo `date +%s`,router1_3,Node Down,Major,,Ping failed >> /usr/local/nmis9/logs/event.log

Navigate to the “Current Events” page in opEvents, you should see node up events, then node down events where 1 event is marked “acknowledged”, if you click on the event you will see the acknowledged action “automatically acknowledged by RCA policy, RCA node: router1_2”. If you click on the active event for router1_2, the description will have Root Event,Nodes Affected: and a list of nodes affected. The GUI should also have a link to view the Node Dependency Map if opCharts is installed and version 4.9.0+ is installed.