Incident
  • 27 Jun 2024
  • 3 Minutes to read
  • Dark
    Light
  • PDF

Incident

  • Dark
    Light
  • PDF

Article summary

Description

When a monitoring check attached to a managed device experiences a state that generates an alarm, Netreo needs a way to centralize and monitor all of the information surrounding that event. This is what an incident is for.

An incident acts as a record of the events surrounding the cause of an alarm and contains all of the associated data and history relevant to it. It collects in one place any and all information related to the alarm event. This information remains archived in the Netreo logs for a period of three years before the system deletes it. Any incident from within that period can be looked up using its unique Incident ID number.

Incidents may be viewed on the Incident dashboard.

Details

Incidents are named using a combination of the title of the alarm involved and the name of the managed device for which the alarm is occurring.

Incident States

Incidents exist in one of four states:

IconState
OPEN - Indicates that the alarm that opened the incident is ongoing and has not been addressed. When an incident is first opened, it immediately attempts to run all of the actions in the action groups assigned to the monitoring check that generated the alarm. The device for which the incident has been opened shows the current alarm state in all dashboards it is displayed in and the incident continues to run relevant actions and perform escalation in accordance with the settings in the failed check's configuration.
ACKNOWLEDGED - Indicates that although an alarm is ongoing, someone is aware of the problem and is currently working on it. Once an incident has been acknowledged, it is this state that is displayed in the dashboards instead of the current alarm state. ACKNOWLEDGED incidents never escalate.
ALARMS CLEARED - If the condition that caused an alarm clears, and the failed check returns to an OK state, the associated incident enters the ALARMS CLEARED state. It then remains in that state for a specified period of time (default is 5 minutes, but this is configurable) until Netreo is sure that the check that generated the original alarm is in a stable OK state - at which point the associated incident is automatically CLOSED. The ALARMS CLEARED period helps prevent additional new incidents from being created due to a flapping alarm condition. If the same alarm reoccurs while the incident is in the ALARMS CLEARED state, the incident simply returns to either the OPEN or ACKNOWLEDGED state - whichever state it was last in before changing to ALARMS CLEARED.
CLOSED - Indicates that the incident is closed and archived for historical and reporting purposes. The alarm associated with the incident has cleared, and the failed check that caused the alarm has remained in the OK state for longer than the ALARMS CLEARED time setting. No further events will be recorded within this incident.

Any incident that is not CLOSED is considered to be an "active incident."

All active incidents can be viewed on the "Active View" tab of the Incident Dashboard (see below).

If the failed monitoring check whose alarm opened an incident had action groups assigned to it when the incident was opened, those action groups are executed every time the incident changes state (including to CLOSED). The original actions from those action groups also become locked to that incident, and any changes to their methods only affect future incidents (see Action Group for more information).

Selecting an incident in the Incident Dashboard opens the Incident View Dashboard for that incident. The Incident View dashboard provides a means to arbitrarily run any desired action groups on that incident manually. It is important to note that any "active response" methods contained in any executed actions are only run when the incident first opens or if the group is run manually. (See Action Group for more information)

Incident Management

To keep incidents as efficient as possible, Netreo includes a useful set of incident management tools that allow multiple alarms to be correlated within a single incident rather than allowing every individual alarm to open its own incident. Alarms bundled into an incident are organized into the "primary alarm" (i.e., the alarm that is the root cause of the incident) and "related alarms" (i.e., alarms that are a result of the primary alarm). Related alarms always have any alert notifications or actions included in their related action groups suppressed. This way, alerts are only ever sent out for the primary alarm, reducing alert noise. (Monitored devices must be properly parented in Netreo for the incident management system to work optimally.)

Acknowledging an incident acknowledges the primary alarm and all of the related alarms contained within that incident.

Netreo administrators can manually add rules to the incident management system to forcibly correlate otherwise unrelated alarms into the same incident.


Was this article helpful?