- 20 Sep 2023
- 7 Minutes to read
- Print
- DarkLight
- PDF
Incident
- Updated on 20 Sep 2023
- 7 Minutes to read
- Print
- DarkLight
- PDF
Description
When a monitoring check attached to a managed device experiences a state that generates an alarm, Netreo needs a way to centralize and monitor all of the information surrounding that event. This is what an incident is for.
An incident acts as a record of the events surrounding the cause of an alarm, and contains all of the associated data and history relevant to it. It collects in one place any and all information related to the alarm event. This information remains archived in the Netreo logs for a period of three years before it is deleted by the system. Any incident from within that period can be looked up using its unique Incident ID number.
Incident Publishing
By default, Netreo publishes all processed incidents to a private Netreo-controlled cloud database for the purpose of collecting metrics used to improve Netreo's products and services.
If you do not wish your incidents to be published to the Netreo cloud, this feature may be deactivated on the Feature Toggle page.
Details
Incidents are named using a combination of the title of the alarm involved and the name of the managed device for which the alarm is occurring.
Incident States
Incidents exist in one of four states:
Icon | State |
---|---|
![]() | OPEN - Indicates that the alarm that opened the incident is ongoing and has not been addressed. When an incident is first opened, it immediately attempts to run all of the actions in the action groups assigned to the monitoring check that generated the alarm. The device for which the incident has been opened shows the current alarm state in all dashboards it is displayed in and the incident continues to run relevant actions and perform escalation in accordance with the settings in the failed check's configuration. |
![]() | ACKNOWLEDGED - Indicates that although an alarm is ongoing, someone is aware of the problem and is currently working on it. Once an incident has been acknowledged, it is this state that is displayed in the dashboards instead of the current alarm state. ACKNOWLEDGED incidents never escalate. |
![]() | ALARMS CLEARED - If the condition that caused an alarm clears, and the failed check returns to an OK state, the associated incident enters the ALARMS CLEARED state. It then remains in that state for a specified period of time (default is 5-minutes, but this is configurable) until Netreo is sure that the check which generated the original alarm is in a stable OK state - at which point the associated incident is automatically CLOSED. The ALARMS CLEARED period helps prevent additional new incidents from being created due to a flapping alarm condition. If the same alarm reoccurs while the incident is in the ALARMS CLEARED state, the incident simply returns to either the OPEN or ACKNOWLEDGED state - whichever state it was last in before changing to ALARMS CLEARED. |
![]() | CLOSED - Indicates that the incident is closed and archived for historical and reporting purposes. The alarm associated with the incident has cleared and the failed check that caused the alarm has remained in the OK state for longer than the ALARMS CLEARED time setting. No further events will be recorded within this incident. |
Any incident that is not CLOSED is considered to be an "active incident."
All active incidents can be viewed on the "Active View" tab of the Incident Dashboard (see below).
If the failed monitoring check whose alarm opened an incident had action groups assigned to it when the incident was opened, those action groups are executed every time the incident changes state (including to CLOSED). The original actions from those action groups also become locked to that incident, and any changes to their methods only affect future incidents (see Action Group for more information).
Selecting an incident in the Incident Dashboard opens the Incident View Dashboard for that incident. The Incident View dashboard provides a means to arbitrarily run any desired action groups on that incident manually. It is important to note that any "active response" methods contained in any executed actions are only run when the incident first opens or if the group is run manually. (See Action Group for more information)
Incident Management
To keep incidents as efficient as possible, Netreo includes a useful set of incident management tools allowing multiple alarms to be correlated within a single incident, rather than allowing every individual alarm to open its own incident. Alarms bundled into an incident are organized into the "primary alarm" (i.e., the alarm that is the root cause of the incident) and "related alarms" (i.e., alarms that are a result of the primary alarm). Related alarms always have any alert notifications or actions included in their related action groups suppressed. This way, alerts are only ever sent out for the primary alarm, reducing alert noise. (Monitored devices must be properly parented in Netreo for the incident management system to work optimally.)
Acknowledging an incident acknowledges the primary alarm as well as all of the related alarms contained within that incident.
Netreo administrators may manually add rules to the incident management system to forcibly correlate otherwise unrelated alarms into the same incident.
The Suppress Alarms Incident
The Suppress Alarms incident is intended as a generic incident management container for alarms that you know will be generated occasionally, but for which you never want to receive alert notifications (for example, a particular network interface going up or down).
To use the Suppress Alarms incident, create an incident management rule and select the Suppress alert and add to the suppressed alarms incident option for the action. (You may create as many incident management rules using this option as required to cover the alarm types for which you do not wish to be alerted.)
The first time an alarm occurs that is processed by a rule using this option, a new "Suppress Alarms" incident is created in the active incidents list and immediately set to the ACKNOWLEDGED state, and the processed alarm is added to that incident as a related alarm (there is never a primary alarm for the Suppress Alarms incident). Any additional alarms that are then processed by any incident management rules using this option also get bundled into the Suppress Alarms incident as related alarms. This ensures that no alert notifications are ever sent for alarms processed by these rules.
If all alarms bundled into the Suppress Alarms incident recover, the incident will move into the ALARMS CLEARED and then CLOSED state, just like any other incident. If a new alarm occurs that is processed by one of the above mentioned rules, the closed Suppress Alarms incident will be re-opened and re-used (and, of course, set to the ACKNOWLEDGED state). This allows the Suppress Alarms incident to keep its incident ID permanently, making it easy to find at any time.
Once it has been created, you can find and view the Suppress Alarms incident in the Incident Dashboard (see below), just like any other incident.
It is not recommended to use this incident management rules option unless you are absolutely sure that you never want to receive alert notifications for alarms processed by the rule using this option.
The Incident Dashboard
The Incident Dashboard may be viewed by users logged in at any access level.
To navigate to the Incident Dashboard go to the main menu and select Quick Views > Active Incidents.
The tabs at the top of the page allow you to view the active incidents in different ways, as well as search for specific incidents.
ActiveView Tab
The "ActiveView" tab of the Incident dashboard displays a list of all currently active incidents. At the top right of the page is a small message indicating when the list was last refreshed
By default, the Incidents list displays sorted by timestamp, but can also be sorted by incident title, current state or incident ID. (Note: If the list is sorted by another column, it cannot be sorted again by timestamp. However, simply navigating to another tab and then back again will restore the timestamp sort.).
At the top left are easy to read circular status indicators showing the number of TOTAL, OPEN, ACKNOWLEDGED and ALARMS CLEARED incidents. If an indicator is dimmed, it means that no incidents of that type currently exist.
This tab also provides tools to acknowledge and de-acknowledge incidents, both individually and en masse. Tick the checkbox next to the incident(s) that you want to acknowledge/de-acknowledge and then select the appropriate option. (Please note that the All button at the top right of the table does not select "all" incidents, but rather switches from a paginated list showing 25 incidents per page to a list that displays all incidents on a single page.)
Selecting the magnifying glass icon for an incident in the ACTIONS column opens the Incident View Dashboard for that incident where you can view it in detail.