Incident Management System
  • 27 Jun 2024
  • 7 Minutes to read
  • Dark
    Light
  • PDF

Incident Management System

  • Dark
    Light
  • PDF

Article summary

Description

Incident management is how Netreo automatically reduces "alert noise" when a downed host is the cause of one or more failed monitoring checks.

By suppressing alert notifications for down services and exceeded thresholds for a device, incident management not only reduces alert noise but automates root-cause discovery by alerting only on the most likely cause of the problem.

Netreo administrators may also set up custom rules within incident management to suppress alerts based on arbitrary criteria. (See Incident Management Rulesets below.)

Incident Publishing

By default, Netreo publishes all processed incidents to a private Netreo-controlled cloud database for the purpose of collecting metrics used to improve Netreo's products and services.

If you do not wish your incidents to be published to the Netreo cloud, you can deactivate this feature on the Feature Toggle page.

Details

When a monitoring check in Netreo fails and generates an alarm, that alarm always tries to open a new incident for itself. This incident then (typically) causes alert notifications to be sent by the action groups assigned to that check.

Ordinarily, this arrangement would result in a deluge of alerts every time a host went down and caused all of its individual checks to fail. Or, worse yet, if a host relied upon by other devices for network access went down and took all of the other devices down with it. Then all of the checks on all of those hosts would fail as well, creating a veritable flood of alerts.

This is where Netreo's incident management system comes in.

Incident management in Netreo is primarily concerned with:

  • Preventing a deluge of alert notifications brought on by a host-down condition causing all monitoring checks on all affected hosts to fail.
  • Automating root-cause discovery by correlating all related monitoring check alarms under a single incident that only allows alerts to be sent for the actual cause of the problem.

The Incident Management Process

Any time a Netreo monitoring check fails and generates an alarm, incident management checks that alarm against a set of rules before the alarm is allowed to open a new incident.

  1. If the alarm is from a service check, perform a host check to determine if a downed host is responsible for the failure and which host it is.
    1. If a downed host is found, suppress any alert notifications from action groups assigned to this check and add the failed check alarm as a related alarm to the incident opened by the host availability service check for the downed host.
    2. If no downed host is found, proceed to the next item in the list.
  2. If the alarm is from a failed service check whose host is up, or is from any other type of monitoring check, look to see if an incident already exists for this alarm.
    1. If an incident does exists, suppress any alert notifications from action groups assigned to this check and add the failed check alarm as a related alarm to the existing incident.
    2. If no incident exists, proceed to the next item in the list.
  3. Process the alarm against each rule in the Incident Management Rules table on the Incident Criteria Administration page (see below) to determine if it should have its alerts suppressed and be added to an existing, arbitrarily related incident.
    1. If the alarm matches the criteria for an existing incident in the Incident Criteria list, suppress any alert notifications from action groups assigned to this check and add the failed check alarm as a related alarm to the existing incident.
    2. If the alarm does not match any criteria, proceed to the next item in the list.
  4. If the alarm does not conform to any of the above circumstances, it is allowed to open a new incident and process any alert notifications from the failed check's action groups.

Incident Management Rulesets

Incident management rules are managed on the Incident Criteria Administration page (Administration > Alerts > Incident Management from the main menu).

Incident management allows administrators to create custom rulesets that incident management can use when looking at incoming alarms to decide on a course of action (send an alert notification, bundle with other alarms, etc.).

These rules may use compound conditional statements combined with regular expressions to cover a variety of circumstances.

One very good use of incident management rules is to allow an alarm to be correlated with another (ostensibly unrelated) alarm and bundled into that alarm's incident. This provides a convenient way to condense multiple (otherwise unrelated) alarms that are related only according to your particular organizational system.

When an incoming alarm is examined, the rule compares it against the listed parameters from the following three categories to decide on its actions.

Parameter Categories and Parameters

Many of the above parameters allow you to check if the alarm is or is not the value specified for the parameter (regular expressions are acceptable values and use match and not match).

Each rule functions as a basic sequence of IF/ELSE statements.

  • IF the results of any of the sets of conditional statements evaluate to true, the rule can take one of the four following actions (AND conditional statements may be added to a rule to narrow down to the specific desired circumstances):
    • Continue on to the correlation section of this rule
    • Create a new incident and add an extra alert contact
    • Create a new incident and send a configured alert notification
    • Create a new incident and send a custom alert notification
    • Suppress alert and add to the suppressed alarms incident
    • Go to the next rule
  • ELSE, if the conditions evaluate to false, the rule can take an alternate action (also from the list above).

Additional ELSE/IF conditional statements may be added to provide more complex flow control.

If you select the Continue on to the correlation section of this rule option, the incoming alarm will be correlated with another alarm as a related alarm (meaning no alert notifications will be sent for it).

If you select the Create a new incident and send a custom alert option, you must then specify a custom alert template to use for sending that alert notification. This option should only be used to respond to highly specific circumstances because using it forces you to select only a single action group that will override all other action groups that may have been selected in the check configuration of the alarm source. This means that only contacts in the action group selected in the rule will receive any alerts from the opened incident.

If you select the Suppress alert and add to the suppressed alarms incident option, the alarm is immediately bundled into the Suppress Alarms incident and no alert notifications are sent. It is not recommended to use this option unless you are absolutely sure that you never want to receive alert notifications for alarms processed by the rule using this option.

The Suppress Alarms Incident

The Suppress Alarms incident is intended as a generic incident management container for alarms that you know will be generated occasionally but for which you never want to receive alert notifications (for example, a particular network interface going up or down).

To use the Suppress Alarms incident, create an incident management rule and select the Suppress alert and add to the suppressed alarms incident option for the action. (You may create as many incident management rules using this option as required to cover the alarm types for which you do not wish to be alerted.)

The first time an alarm occurs that is processed by a rule using this option, a new "Suppress Alarms" incident is created in the active incidents list and immediately set to the ACKNOWLEDGED state, and the processed alarm is added to that incident as a related alarm (there is never a primary alarm for the Suppress Alarms incident). Any additional alarms that are then processed by any incident management rules using this option also get bundled into the Suppress Alarms incident as related alarms. This ensures that no alert notifications are ever sent for alarms processed by these rules.

If all alarms bundled into the Suppress Alarms incident recover, the incident will move into the ALARMS CLEARED and then CLOSED state, just like any other incident. If a new alarm occurs that is processed by one of the above mentioned rules, the closed Suppress Alarms incident will be re-opened and re-used (and, of course, set to the ACKNOWLEDGED state). This allows the Suppress Alarms incident to keep its incident ID permanently, making it easy to find at any time.

Once it has been created, you can find and view the Suppress Alarms incident on the Active Incident page, just like any other incident.

It is not recommended to use this incident management rules option unless you are absolutely sure that you never want to receive alert notifications for alarms processed by the rule using this option.

Note: Once an alarm has been associated with the Suppress Alarms incident, the status of the monitoring check that experienced the alarm will be permanently associated with that incident, regardless of that check's status. The only way to remove the association is by manually closing the Suppress Alarms incident. This can be done using the Close Incident button found at the top right of the Incident Detail panel of the Incident View page for that incident.


Was this article helpful?