Incident Management System
  • 02 Feb 2024
  • 5 Minutes to read
  • Dark
    Light
  • PDF

Incident Management System

  • Dark
    Light
  • PDF

Article Summary

Description

Incident management is how Netreo automatically reduces "alert noise" when a downed host is the cause of one or more failed monitoring checks.

By suppressing alert notifications for down services and exceeded thresholds for a device, incident management not only reduces alert noise, but automates root-cause discovery by alerting only on the most likely cause of the problem.

Netreo administrators may also set up custom rules within incident management to suppress alerts based on a wide variety of arbitrary criteria. (See Incident Management Rulesets below.)

Details

When a monitoring check in Netreo fails and generates an alarm, that alarm always tries to open a new incident for itself. This incident then (typically) causes alert notifications to be sent by the action groups assigned to that check.

Ordinarily, this arrangement would result in a deluge of alerts every time a host went down and caused all of its individual checks to fail. Or, worse yet, if a host relied upon by other devices for network access went down and took all of the other devices down with it. Then all of the checks on all of those hosts would fail as well, creating a veritable flood of alerts.

This is where Netreo's incident management system comes in.

Incident management in Netreo is primarily concerned with:

  • Preventing a deluge of alert notifications brought on by a host-down condition causing all monitoring checks on all affected hosts to fail.
  • Automating root-cause discovery by correlating all related monitoring check alarms under a single incident that only allows alerts to be sent for the actual cause of the problem.

The Incident Management Process

Any time a Netreo monitoring check fails and generates an alarm, incident management checks that alarm against a set of rules before the alarm is allowed to open a new incident.

  1. If the alarm is from a service check, perform a host check  to determine if a downed host is responsible for the failure, and which host it is.
    1. If a downed host is found, suppress any alert notifications from action groups assigned to this check and add the failed check alarm as a related alarm to the incident opened by the host availability service check for the downed host.
    2. If no downed host is found, proceed to the next item in the list.
  2. If the alarm is from a failed service check whose host is up, or is from any other type of monitoring check, look to see if an incident already exists for this alarm.
    1. If an incident does exists, suppress any alert notifications from action groups assigned to this check and add the failed check alarm as a related alarm to the existing incident.
    2. If no incident exists, proceed to the next item in the list.
  3. Process the alarm against each rule in the Incident Management Rules table on the Incident Criteria Administration page (see below) to determine if it should have its alerts suppressed and be added to an existing, arbitrarily related, incident.
    1. If the alarm matches the criteria for an existing incident in the Incident Criteria list, suppress any alert notifications from action groups assigned to this check and add the failed check alarm as a related alarm to the existing incident.
    2. If the alarm does not match any criteria, proceed to the next item in the list.
  4. If the alarm does not conform to any of the above circumstances, it is allowed to open a new incident and process any alert notifications from the failed check's action groups.

Incident Management Rulesets

Incident management rules are managed on the Incident Criteria Administration page (Administration > Alerts > Incident Management from the main menu).

Incident management allows administrators to create custom rulesets that incident management can use when looking at incoming alarms to decide on a course of action (send an alert notification, bundle with other alarms, etc.).

These rules may use compound conditional statements combined with regular expressions to cover a variety of circumstances.

One very good use of incident management rules is to allow an alarm to be correlated with another (ostensibly unrelated) alarm and bundled into that alarm's incident. This provides a convenient way to condense multiple (otherwise unrelated) alarms that are related only according to your particular organizational system.

When an incoming alarm is examined, the rule compares it against the listed parameters from the following three categories to decide on its actions.

Parameter Categories and Parameters

Many of the above parameters allow you to check if the alarm is or is not the value specified for the parameter (regular expressions are acceptable values and use match and not match).

Each rule functions as a basic sequence of IF/ELSE statements.

  • IF the results of any of the sets of conditional statements evaluate to true, the rule can take one of the four following actions (AND conditional statements may be added to a rule to narrow down to the specific desired circumstances):
    • Continue on to the correlation section of this rule
    • Create a new incident and add an extra alert contact
    • Create a new incident and send a configured alert notification
    • Create a new incident and send a custom alert notification
    • Suppress alert and add to the suppressed alarms incident
    • Go to the next rule
  • ELSE, if the conditions evaluate to false, the rule can take an alternate action (also from the list above).

Additional ELSE/IF conditional statements may be added to provide more complex flow control.

If you select the Continue on to the correlation section of this rule option, the incoming alarm will be correlated with another alarm as a related alarm (meaning no alert notifications will be sent for it).

If you select the Create a new incident and send a custom alert option, you must then specify a custom alert template to use for sending that alert notification. This option should only be used to respond to highly specific circumstances because using it forces you to select only a single action group that will override all other action groups that may have been selected in the check configuration of the alarm source. This means that only contacts in the action group selected in the rule will receive any alerts from the opened incident.

If you select the Suppress alert and add to the suppressed alarms incident option, the alarm is immediately bundled into the Suppress Alarms incident and no alert notifications are sent. It is not recommended to use this option unless you are absolutely sure that you never want to receive alert notifications for alarms processed by the rule using this option.


Was this article helpful?