Skip to end of banner
Go to start of banner

Deduplication and storm control in opEvents

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

opEvents provides two mechanisms to handle repeated event occurrences in a practical fashion, namely stateful event deduplication and programmable event suppression.

Stateful Deduplication

All events that are related to stateful entities (e.g. a node which can be in state up or down, an interface etc.) are automatically checked against the recent history of events and the known previous state of this entity. If the new  event reports the same state as the already known one, then the new event is suppressed completely: no event record is created  (except for raw logging, if that is enabled).

In practice this means that when there are multiple reports of a "Node down" around the same time, then only the first event will show up on the opEvents dashboard. This type of deduplication is essential for dealing with event storms; it is therefore always active and non-adjustable.

Related to that is the concept of a Flap, which in opEvents is defined as a sequence of state down and back up transitions within a short time frame. opEvents uses the configuration option state_flap_window to define this window, by default 90 seconds. In a flap situation, the up event is marked as flap event, and its event name is changed to "<state entity> Flap"; it is also marked as associated to the previous down event, and any repeat events that don't convey a new state are suppressed.

This behaviour can be fine-tuned using the configuration option opevents_no_action_on_flap (default: true): when set to true opEvents will automatically acknowledge the related down event and set the down event action_required to false. This causes any actions defined in policies for the down event to be stopped. If opevents_no_action_on_flap is false, then the down event is not modified and remains open when a flap is detected.

Programmable Suppression

To provide fine-grained control of how to handle repeated events of any kind, opEvents also supports programmable event suppression. Using this facility the administrator can define flexible rules for when to suppress repeat events, based on the recent event history and some further refinement criteria. Please note, however, that programmable suppression is available only for classes or groups of events and cannot be enabled specifically for a single node only.

The configuration file EventRules.nmis can contain any number of user-defined event synthesis and suppression directives given in a simple, almost self-explanatory format.

A suppression rule consists of:

  • a rule name, which is for display purposes only when suppression is concerned,
  • a list of events (more precisely, their names), which are the events to consider for suppression,
  • an optional list of groupby clauses, which define whether thresholds are to be interpreted globally for all named events, or separately within smaller groups,
  • a window parameter, which defines the time window to examine,
  • optional delayedaction and autoacknowledge parameters (in opEvents 2.0.4 and newer),
  • and a suppress clause with a min and/or a max occurrence parameter.

Note that this configuration file can also contain rules for Event Synthesis, which differ just slightly (they have a count parameter and no suppress clause).

Here is an example rule:

'5' => {
 	name=>"suppressing repeats", # name not relevant for suppression
 	events=>['Node Configuration Change'],
	groupby=>['node.name'],
	window=>120,
	suppress=>{min=>2, max=>8},
},

All such rules are applied independently to an event (until one indicates suppression or the end of the rule list is reached).

All named events that are listed in a suppression rule and which have occurred in the preceding window seconds are checked and counted together. Listing multiple events in one rule will lump them together as far as the occurrence counting is concerned. These recent events will then be apportioned to groups if groupby is used, and then the event count is compared to the min/max occurrence parameters. If the count is above min and below max, then the new event is marked as a duplicate (of the oldest event that was counted) and has its action_checked property set to 1 which prevents any future policy actions (e.g. escalations) from being executed; the event is nevertheless shown in the opEvents GUI.

If the suppression clause contains no min parameter, then a minimum of 1 is assumed. If no max is present, then infinity is used. Both min and max include the current event, so a min of 2 will suppress the first and further repeats.

Delaying and Closing of Trigger Events

In opEvents 2.0.4 and newer, suppression rules can optionally specify a number for the  delayedaction property, to delay all policy action processing for potential trigger events. If the criteria for suppression are met within the delay period, then all action processing will be aborted and skipped for these suppressed events. If the autoacknowledge property is also set, then the suppression includes not just aborting action processing but also marking the event as acknowledged.

Grouping

If no groupby clause is present, then the set of matching events is counted directly, which may be too generic for many common scenarios. For example suppressing events for a particular customer or service group wouldn't be possible. Grouping solves this problem: the set is split into groups with matching property values and the thresholds are applied to those groups.

The groupby clause has the form of a list of node.X or event.Y property specifications (e.g. node.customer or node.group), which are used to group events into buckets for counting: only events that share the same values for all the listed grouping properties will be counted together. For example, the groupby clause [ 'node.customer', 'event.priority' ] would cause this suppression rule to be applied independently for all combinations of customer and event priority. The clause given in the example block above will suppress 2-8 node configuration events for any individual node within 120 seconds; without the groupby repeat node configuration events would be suppressed regardless of where they happened.

The common node properties are listed here, and the standard event properties are documented on this page.

  • No labels