Incident
  • 22 Jan 2024
  • 5 Minutes to read
  • Dark
    Light
  • PDF

Incident

  • Dark
    Light
  • PDF

Article Summary

Description

When a monitoring check attached to a managed device experiences a state that generates an alarm, Netreo needs a way to centralize and monitor all of the information surrounding that event. This is what an incident is for.

An incident acts as a record of the events surrounding the cause of an alarm, and contains all of the associated data and history relevant to it. It collects in one place any and all information related to the alarm event. This information remains archived in the Netreo logs for a period of three years before it is deleted by the system. Any incident from within that period can be looked up using its unique Incident ID number.

Incidents may be viewed on the Incident dashboard.

Incident Publishing

By default, Netreo publishes all processed incidents to a private Netreo-controlled cloud database for the purpose of collecting metrics used to improve Netreo's products and services.

If you do not wish your incidents to be published to the Netreo cloud, this feature may be deactivated on the Feature Toggle page.

Details

Incidents are named using a combination of the title of the alarm involved and the name of the managed device for which the alarm is occurring.

Incident States

Incidents exist in one of four states:

IconState
OPEN - Indicates that the alarm that opened the incident is ongoing and has not been addressed. When an incident is first opened, it immediately attempts to run all of the actions in the action groups assigned to the monitoring check that generated the alarm. The device for which the incident has been opened shows the current alarm state in all dashboards it is displayed in and the incident continues to run relevant actions and perform escalation in accordance with the settings in the failed check's configuration.
ACKNOWLEDGED - Indicates that although an alarm is ongoing, someone is aware of the problem and is currently working on it. Once an incident has been acknowledged, it is this state that is displayed in the dashboards instead of the current alarm state. ACKNOWLEDGED incidents never escalate.
ALARMS CLEARED - If the condition that caused an alarm clears, and the failed check returns to an OK state, the associated incident enters the ALARMS CLEARED state. It then remains in that state for a specified period of time (default is 5-minutes, but this is configurable) until Netreo is sure that the check which generated the original alarm is in a stable OK state - at which point the associated incident is automatically CLOSED. The ALARMS CLEARED period helps prevent additional new incidents from being created due to a flapping alarm condition. If the same alarm reoccurs while the incident is in the ALARMS CLEARED state, the incident simply returns to either the OPEN or ACKNOWLEDGED state - whichever state it was last in before changing to ALARMS CLEARED.
CLOSED - Indicates that the incident is closed and archived for historical and reporting purposes. The alarm associated with the incident has cleared and the failed check that caused the alarm has remained in the OK state for longer than the ALARMS CLEARED time setting. No further events will be recorded within this incident.

Any incident that is not CLOSED is considered to be an "active incident."

All active incidents can be viewed on the "Active View" tab of the Incident Dashboard (see below).

If the failed monitoring check whose alarm opened an incident had action groups assigned to it when the incident was opened, those action groups are executed every time the incident changes state (including to CLOSED). The original actions from those action groups also become locked to that incident, and any changes to their methods only affect future incidents (see Action Group for more information).

Selecting an incident in the Incident Dashboard opens the Incident View Dashboard for that incident. The Incident View dashboard provides a means to arbitrarily run any desired action groups on that incident manually. It is important to note that any "active response" methods contained in any executed actions are only run when the incident first opens or if the group is run manually. (See Action Group for more information)

Incident Management

To keep incidents as efficient as possible, Netreo includes a useful set of incident management tools allowing multiple alarms to be correlated within a single incident, rather than allowing every individual alarm to open its own incident. Alarms bundled into an incident are organized into the "primary alarm" (i.e., the alarm that is the root cause of the incident) and "related alarms" (i.e., alarms that are a result of the primary alarm). Related alarms always have any alert notifications or actions included in their related action groups suppressed. This way, alerts are only ever sent out for the primary alarm, reducing alert noise. (Monitored devices must be properly parented in Netreo for the incident management system to work optimally.)

Acknowledging an incident acknowledges the primary alarm as well as all of the related alarms contained within that incident.

Netreo administrators may manually add rules to the incident management system to forcibly correlate otherwise unrelated alarms into the same incident.

The Suppress Alarms Incident

The Suppress Alarms incident is intended as a generic incident management container for alarms that you know will be generated occasionally, but for which you never want to receive alert notifications (for example, a particular network interface going up or down).

To use the Suppress Alarms incident, create an incident management rule and select the Suppress alert and add to the suppressed alarms incident option for the action. (You may create as many incident management rules using this option as required to cover the alarm types for which you do not wish to be alerted.)

The first time an alarm occurs that is processed by a rule using this option, a new "Suppress Alarms" incident is created in the active incidents list and immediately set to the ACKNOWLEDGED state, and the processed alarm is added to that incident as a related alarm (there is never a primary alarm for the Suppress Alarms incident). Any additional alarms that are then processed by any incident management rules using this option also get bundled into the Suppress Alarms incident as related alarms. This ensures that no alert notifications are ever sent for alarms processed by these rules.

If all alarms bundled into the Suppress Alarms incident recover, the incident will move into the ALARMS CLEARED and then CLOSED state, just like any other incident. If a new alarm occurs that is processed by one of the above mentioned rules, the closed Suppress Alarms incident will be re-opened and re-used (and, of course, set to the ACKNOWLEDGED state). This allows the Suppress Alarms incident to keep its incident ID permanently, making it easy to find at any time.

Once it has been created, you can find and view the Suppress Alarms incident in the Incident Dashboard (see below), just like any other incident.

It is not recommended to use this incident management rules option unless you are absolutely sure that you never want to receive alert notifications for alarms processed by the rule using this option.


Was this article helpful?