- 08 Sep 2023
- 11 Minutes to read
- Updated on 08 Sep 2023
- 11 Minutes to read
Use a threshold check to monitor a single statistic of a managed device or application (CPU utilization, hardware temperature, SQL deadlocks, and so forth) and measure it against a set of high or low threshold values, providing insight into imminent failures and detection of actual failures.
Threshold checks provide two modes of monitoring:
- Static threshold evaluation
- Dynamic threshold evaluation (anomaly detection)
Either may be used independently of the other, or both may be used simultaneously.
The statuses of threshold checks are displayed in a variety of locations in Netreo, but by far the most common is in a Tactical Overview dashboard widget.
Threshold checks may be configured to monitor for both high and low values, high values only, or low values only. A threshold check detecting unacceptable values for its monitored statistic displays as failed in Netreo dashboards and generates an alarm. Additionally, a threshold check can be configured to run commands on the device or application it is monitoring when the check fails (such as reboot or restart-service commands). This provides the possibility of resolving some issues automatically, without the need to involve personnel.
By default, Netreo automatically adds several preconfigured threshold checks to each managed device to provide basic monitoring services. However, Netreo collects many other statistics to which you can add a threshold check to suit your specific monitoring needs.
The threshold check measures a statistic against a set of static threshold values and determine if the value of the collected statistic has exceeded any of those thresholds. It is concerned with the current statistic. This functionality provides basic performance monitoring of a statistic. For more advanced monitoring, see Dynamic Thresholds (Anomaly Detection).
Dynamic Thresholds (Anomaly Detection)
Anomaly detection is a more advanced function of a threshold check. It uses adaptive, dynamic threshold values to determine if a statistic is experiencing values that are inconsistent with what is normally expected from that statistic at this time, given its history.
As an example, say CPU utilization for server X has been 85% at this time of the day, for this day of the week, for the past eight weeks, but is currently showing 55%. Given its history, that's not the expected utilization value for this time. This could indicate that an important process on that server has stopped running, or that clients are unable to connect and utilize the server resources.
Since neither of the reported values are particularly high or low, static threshold monitoring would not indicate anything unusual happening. But, the dynamic threshold values of anomaly detection look at the collected values in the context of their history and develop an understanding of what is typical for that statistic at any given time.
When used together, static thresholds and anomaly detection are extremely powerful. However, you are free to use them independently of each other in any given threshold check. Either may be configured alone without requiring the other to be configured.
Threshold Check States
Threshold checks display one of the following states when viewed in dashboard widgets.
|OK||(Green) The value of the statistic is within the user-determined acceptable operating range.|
|WARNING||(Yellow) The value of the statistic is higher than the configured high WARNING value or lower than the configured low WARNING value, but has not yet exceeded the configured CRITICAL value for either.|
|CRITICAL||(Red) The value of the statistic is higher than the configured high CRITICAL value or lower than the configured low CRITICAL value.|
|ACKNOWLEDGED||(Blue) Indicates a threshold check in a CRITICAL state that has been acknowledged by a user. This state is technically an incident state, not a check state, but its display in the dashboards helps to distinguish between problems that are new and problems that are already being addressed.|
The Tactical Overview dashboard widget is useful for displaying threshold check status for devices and groups of devices.
In Netreo, many collected statistics are configured and monitored as pairs (for example, bandwidth utilization pairs inbound traffic with outbound traffic). Threshold checks are also configured in these same pairs, with the opposing statistics identified as VARIABLE ONE and VARIABLE TWO in the check (the actual statistic name is also identified alongside the variable label). The static threshold and anomaly detection settings for each variable are independent, but all other configuration settings for the check are shared between the two.
When configuring static threshold values, the units for the high and low fields will automatically be appropriate for the type of statistic selected (e.g. CPU utilization would show percent, while latency would show seconds). A pull-down selector next to the value allows you to specify a multiplier prefix for the entered value. This allows you to configure the check using values with which you are comfortable and have Netreo do the math for you.
Netreo always collects and stores the retrieved values for each managed device and application statistic to which it has access, maintaining a historical record of the performance of that statistic. This record provides the data set that a threshold check uses to evaluate the current state of the statistic. This data set is used differently by static thresholds and anomaly detection, as explained below.
(The performance history of each statistic is also graphed on the Performance tab of the Device Dashboard for any given device. So, even without a threshold check, a statistic could still be monitored manually.)
Threshold checks generate an alarm immediately upon reaching a CRITICAL state. To generate an alarm on a WARNING state, select Administration > System > Feature Toggle and set the feature toggle for Incidents on warning thresholds exceeded to ON. By default, this feature is configured to OFF. For more information about toggling theIncidents on warning thresholds exceeded feature to ON or OFF, see Feature Toggle.
A threshold check continues to retrieve and evaluate statistic values according to its configuration even after an alarm has been generated. If the value returns to normal at any time, the check immediately recovers to the OK state, clears its alarm, and signals any opened incident that it has recovered.
When the latest value for a statistic is collected, the threshold check calculates an average value made up of the most recently collected values. (The number of values averaged is determined when configuring the check.) This averaged value is then compared against the configured static high and low threshold values to see if any have been exceeded.
The reason for averaging the most recent values instead of directly using the last raw value collected is to avoid generating an unnecessary alarm due to a momentary spike in the value. Averaging smooths the values into a more reliable indicator of that statistic's current condition. (If a collected value is a NaN, that value is ignored by the check, resulting in one less value being averaged until the NaN ages out of the data set.)
Static thresholds allow independent values to be set to trigger WARNING and CRITICAL states for both high and low value conditions.
If the averaged value exceeds any high or low configured warning thresholds, the check enters the WARNING state. This state displays in the dashboards, but no other action is taken by Netreo.
If the averaged value exceeds any high or low configured critical thresholds, the check enters the CRITICAL state. This state displays in the dashboards and generates an alarm.
An anomaly check compares the most recently polled value of a statistic against a set of dynamic threshold values computed from a data set that samples eight previously polled values.
The check may be configured to look for upper boundary and/or lower boundary anomalies (similar to high and low static threshold values). These upper and lower boundaries represent dynamic threshold values that are computed as deviations from the mean of the eight sampled values. As each execution of the check drops older samples from the data set and adds newer samples, these boundaries are continuously recomputed to establish what should be considered "normal" for the statistic at the time of polling.
The amount that the upper and lower boundary thresholds deviate from the mean is controlled by the check's anomaly sensitivity. A lower sensitivity causes the boundary values to be further from the computed mean of the samples, meaning the more abnormal a polled value must be to be considered an anomaly. A higher sensitivity causes the boundary values to be closer to the mean, meaning the less abnormal a polled value must be to be considered an anomaly.
The appropriate sensitivity for anomaly detection is highly subjective, depending on circumstances. Trial and error will be required for each situation to achieve optimal performance. (Hint: A higher sensitivity can detect smaller deviations, while a lower sensitivity can only detect larger deviations.)
Anomaly detection includes independent sensitivity settings to trigger both WARNING and CRITICAL states.
If the current value of the statistic exceeds the computed warning upper or lower boundary values, the current value is considered a potential anomaly and the check enters the WARNING state. This state displays in the dashboards, but no other action is taken by Netreo.
If the current value of the statistic exceeds the computed critical upper or lower boundary values, the current value is considered an anomaly and the check enters the CRITICAL state. This state displays in the dashboards and generates an alarm.
Anomaly Detection Samples
The eight previous data values sampled are not simply the eight most recent values polled. The range of time between the samples is adjustable, but each sample is always from at least one hour earlier than the next. These eight samples are always taken from the same relative timestamp as the current polled value, and are called a season.
So, an anomaly check with a season setting of Hour that polls a statistic at 8:05 p.m. samples values from 7:05 p.m., 6:05 p.m., 5:05 p.m., 4:05 p.m., 3:05 p.m., 2:05 p.m., 1:05 p.m., and 12:05 p.m. These are all exactly one hour apart.
Five minutes later, when the statistic is polled again at 8:10 p.m. values from 7:10 p.m., 6:10 p.m., 5:10 p.m., 4:10 p.m., 3:10 p.m., 2:10 p.m., 1:10 p.m., and 12:10 p.m. are used. Again, all exactly one hour apart.
Selecting a different season simply changes the amount of time between the sampled values. Your choices are Hour, Day or Week. (Be aware that an anomaly alarm will remain active until either, the offending data sample has cleared from the sample values, or the dynamic threshold values have been recomputed such as to make the offending data sample no longer outside of the check's sensitivity range. This could be quite a long time if the Week season is selected. So, remember to acknowledge any resulting incidents.)
Previous Anomaly Samples
If one of the previous eight data values being sampled was itself an anomaly, that sample is excluded from the detection calculations. However, if more than one of the previous data values was an anomaly, those values are used in the detection calculations. This is how the threshold check dynamically adapts to gradual changes in behavior that are, in fact, perfectly normal.
Minimum Sample Values
Each sampled data point must also be of at least a minimum value to even be checked for anomalous behavior (configured in the check using the Min Value field). Values below this setting will never be used by the anomaly engine, even as a previous data sample (it will simply be dropped from the data set). This is to prevent a sequence of extremely low values from producing false positives from only minor deviations. (Changing from an average of 1 to 0.001 is a 1000% deviation, but still not likely to be any kind of problem.)
However, if a polled value is below the minimum setting after an anomaly is detected and an alarm generated, the current alarm is automatically cleared and the opened incident notified. This is for housekeeping purposes to prevent a detected upper boundary anomaly from causing the check to be stuck in an alarm state due to never processing the current data point if the polled values remain below the minimum.
Unfortunately, this same logic also causes a lower boundary anomaly alarm to clear, negating the value of setting a lower boundary check. It is therefore recommended to exercise caution when using a minimum value with a lower boundary check, as polled values below this setting will both prevent a lower boundary anomaly from triggering an alarm and cause any existing alarm to be cleared.
(The minimum value setting is for anomaly detection only, and does not interfere with static threshold check operations.)
It is highly recommended that threshold checks be added to devices and managed through device templates, and not directly on devices. Even in unique device-specific circumstances, threshold checks for that device can still be managed using a device template that includes the desired threshold checks and is assigned directly to the device.
The only circumstance under which a threshold check should ever be added to a device directly is when that device has had its device template functionality turned off completely.
Static Threshold Check Time Periods
Since Netreo polls and records a statistic's value every five minutes, selecting a TIME PERIOD of 5 Min when configuring a static threshold check means that it would only take one poll that exceeded the warning or critical threshold values to trigger a change in state. However, selecting a period of 15 Min would require three consecutive polls (with an average value exceeding the warning or critical threshold values) to trigger a state change. This field is an important adjustment for reducing false alarms.
Divide the TIME PERIOD value configured in the check by 5 to figure out the number of recent samples that will be averaged before being compared to the threshold values.