Threshold Check
  • 25 Oct 2024
  • 17 Minutes to read
  • Dark
    Light
  • PDF

Threshold Check

  • Dark
    Light
  • PDF

Article summary

Description

Netreo always collects all available statistics from all managed devices and applications (CPU utilization, hardware temperature, SQL deadlocks, and so forth).

Use a threshold check to actively monitor a single statistic and measure it against a set of high or low threshold values. This provides insight into imminent failures and the detection of actual failures.

A threshold check provides two modes of monitoring for a given statistic:

  • Static threshold evaluation
  • Dynamic threshold evaluation (anomaly detection)

Each mode can be used independently of the other, or both can be used simultaneously.

The states of threshold checks are displayed in various locations in Netreo, but the most common is in a Tactical Overview dashboard widget

Threshold checks can be configured to monitor for both high and low values, only high values, or only low values. A threshold check detecting unacceptable values for its monitored statistic displays as failed in Netreo dashboards and generates an alarm. Additionally, a threshold check can be configured to run commands on the device or application it is monitoring when the check fails (such as reboot or restart-service commands). This provides the possibility of resolving some issues automatically without the need to involve personnel.

By default, Netreo automatically adds several preconfigured threshold checks to each managed device to provide basic monitoring and alerting services. However, Netreo collects many other statistics to which you can add a threshold check to suit your specific monitoring needs.

Static Thresholds

The threshold check measures a statistic's current value against a set of static threshold values and determines if the value of the collected statistic has exceeded any of those thresholds. This functionality provides basic performance monitoring of a statistic. For more advanced monitoring, see Dynamic Thresholds (Anomaly Detection).

Dynamic Thresholds (Anomaly Detection)

Anomaly detection is a more advanced function of a threshold check. It uses adaptive, dynamic threshold values to determine if a statistic is experiencing values that are inconsistent with what is normally expected from that statistic at this time, given its history.

As an example, say CPU utilization for server X has been 85% at this time of day, for this day of the week, for the past eight weeks, but is now showing 55%. Given its history, that's not the expected utilization value for this time. This could indicate that an important process on that server has stopped running or that clients are unable to connect and utilize the server resources.

Since neither of the reported values is particularly high or low, static threshold monitoring would not indicate anything unusual happening. However, the dynamic threshold values of anomaly detection look at the collected values in the context of their history and develop an understanding of what is typical for that statistic at any given time.

When used together, static thresholds and anomaly detection are extremely powerful. However, you are free to use them independently of each other in any given threshold check. Each can be configured alone without requiring the other to be configured.

Details

Data Collection

Netreo always collects and stores the retrieved values for each statistic of a managed device and application to which it has access, maintaining a historical record of the performance of that statistic. A threshold check uses a specific subset of that data to evaluate the current state of the statistic. Note, however, that the specific subset of data used is different for static thresholds and anomaly detection, as explained in each relevant section below.

(Note: The performance history of each collected statistic is graphed on the Performance tab of the Device Dashboard for any given managed device and does not require a threshold check for it to be viewable.)

Statistic Pairs

In Netreo, many collected statistics are configured and monitored as pairs (for example, bandwidth utilization pairs inbound traffic with outbound traffic). Threshold checks are also configured in these same pairs, with the opposing statistics identified as VARIABLE ONE and VARIABLE TWO in the check (the actual statistic name is also identified alongside the variable label). The static threshold and anomaly detection settings for each variable are independent, but all other configuration settings for the check are shared between the two.

When configuring static threshold values, the units for the high and low fields will automatically be appropriate for the type of statistic selected (for example, CPU utilization would show percent, while latency would show seconds). A pull-down selector next to the value allows you to specify a multiplier prefix for the entered value. This allows you to configure the check using values with which you are comfortable and have Netreo do the math for you.

Errors Per Second
Netreo always measures errors per second values as milli-errors per second, which allows for err/sec measurements of less than one. See Understanding Errors per Second for more information on calculating err/sec values that can be used in threshold checks. When entering err/sec values, remember to select the milli (m) multiplier prefix.

Threshold Check States

Threshold checks display one of the following states when viewed in dashboards.

StateDescription
OK(Green) The value of the statistic is within the user-determined acceptable operating range.
WARNING(Yellow) The value of the statistic is higher than the configured high WARNING value or lower than the configured low WARNING value but has not yet exceeded the configured CRITICAL value for either.
CRITICAL(Red) The value of the statistic is higher than the configured high CRITICAL value or lower than the configured low CRITICAL value. Generates an alarm.
ACKNOWLEDGED(Blue) Indicates a threshold check in a CRITICAL state that a user has acknowledged. This state is technically an incident state, not a check state, but its display in the dashboards helps to distinguish between problems that are new and problems that are already being addressed.

Threshold checks always generate an alarm immediately upon reaching the CRITICAL state but normally do not generate an alarm when reaching the WARNING state. To generate an alarm for threshold checks reaching the WARNING state, the "Incidents on warning thresholds exceeded" feature must be activated on Netreo's Feature Toggle administration page (this feature is turned off by default).

A threshold check continues to retrieve and evaluate statistic values according to its configuration even after an alarm has been generated. If the value returns to normal at any time, the check immediately recovers to the OK state, clears its alarm, and signals any opened incident that it has recovered.

The Tactical Overview dashboard widget is useful for displaying threshold check status in dashboards.

Static Thresholds

To activate static threshold monitoring for a threshold check during configuration, enter values into the HIGH and/or LOW fields for warning (yellow) and/or critical (red) states for either or both variables (if applicable). Leaving a field empty prevents Netreo from monitoring that aspect of the statistic. This is useful if you want to monitor for only high or low values for the statistic. Leaving all fields empty will leave static threshold monitoring off. This is useful to configure only anomaly detection for the statistic.

When the latest value for a statistic is collected, the threshold check calculates an average value made up of the most recently collected values. (The number of values averaged is determined when configuring the check.) This averaged value is then compared against the configured static high and low threshold values to see if any have been exceeded.

The reason for averaging the most recent values instead of directly using the last raw value collected is to avoid generating an unnecessary alarm due to a momentary spike in the value. Averaging smooths the values into a more reliable indicator of that statistic's current condition. (If a collected value is a NaN, that value is ignored by the check, resulting in one less value being averaged until the NaN ages out of the data set.)

Static thresholds allow independent values to be set to trigger WARNING (yellow) and CRITICAL (red) states for both high- and low-value conditions.

If the averaged value exceeds any high or low configured warning thresholds, the check enters the WARNING state. This state is displayed in dashboards, but Netreo does not take any other action.

If the averaged value exceeds any high or low configured critical thresholds, the check enters the CRITICAL state. This state is displayed in dashboards and generates an alarm.

Anomaly Detection

An anomaly check compares the most recently polled value of a statistic against a set of dynamic threshold values computed from a data set that samples eight previously polled values.

To activate anomaly detection for a threshold check during configuration, select a value for the Boundary field (in the ANOMALY section) other than None.

The check can be configured to look for upper boundary and/or lower boundary anomalies (similar to high and low static threshold values). These upper and lower boundaries represent dynamic threshold values that are computed as deviations from the mean of the eight sampled values. As each execution of the check drops older samples from the data set and adds newer samples, these boundaries are continuously recomputed to establish what should be considered "normal" for the statistic at the time of polling.

The amount that the upper and lower boundary thresholds deviate from the mean is controlled by the anomaly sensitivity setting of each individual threshold check. A lower sensitivity causes the boundary values to be further from the computed mean of the samples, meaning the more abnormal a polled value must be to be considered an anomaly. A higher sensitivity causes the boundary values to be closer to the mean, meaning the less abnormal a polled value must be to be considered an anomaly.

The boundaries are calculated using the formula (standard deviation of sample set * 4) + (mean of sample set * sensitivity factor).

  • High sensitivity factor = 0.0
  • Medium sensitivity factor = 0.1
  • Low sensitivity factor = 0.5

The appropriate sensitivity for anomaly detection is highly subjective, depending on the circumstances. Trial and error will be required for each situation to achieve optimal performance. (A higher sensitivity can detect smaller deviations, while a lower sensitivity can only detect larger deviations.)

Anomaly detection includes independent sensitivity settings to trigger both WARNING and CRITICAL states.

If the current statistic value exceeds the computed warning upper or lower boundary values, it is considered a potential anomaly, and the check enters the WARNING state. This state is displayed in the dashboards, but Netreo does not take any other action.

If the current statistic value exceeds the computed critical upper or lower boundary values, it is considered an anomaly, and the check enters the CRITICAL state. This state is displayed in the dashboards and generates an alarm.

Limitations in Anomaly Detection

It is not recommended to activate anomaly detection for statistics that use percentages for their values, such as CPU utilization, interface bandwidth utilization, etc. Due to the way anomaly boundary thresholds are calculated, it is possible to create a situation in which anomalies are never detected when monitoring percentage-based statistics. To monitor percentage-based statistics for unacceptable performance values, it is recommended to use static thresholds. Or, at least, static thresholds in addition to anomaly detection.

Anomaly Detection Samples

The eight previous data values sampled are not simply the eight most recent values polled. The range of time between the samples is adjustable, but each sample is always from at least one hour earlier than the next. These eight samples are always taken from the same relative time stamp as the current polled value and are called a season.

So, an anomaly check with a season setting of Hour that polls a statistic at 8:05 P.M. samples values from 7:05 P.M., 6:05 P.M., 5:05 P.M., 4:05 P.M., 3:05 P.M., 2:05 P.M., 1:05 P.M., and 12:05 P.M. These are all exactly one hour apart.

Five minutes later, when the statistic is polled again at 8:10 P.M., values from 7:10 P.M., 6:10 P.M., 5:10 P.M., 4:10 P.M., 3:10 P.M., 2:10 P.M., 1:10 P.M., and 12:10 P.M. are used. Again, they are all exactly one hour apart.

Selecting a different season simply changes the amount of time between the sampled values. Your choices are Hour, Day, or Week. (Be aware that an anomaly alarm will remain active until either the offending data sample has cleared from the sample values or the dynamic threshold values have been recomputed (such as to make the offending data sample no longer outside of the check's sensitivity range). This could be quite a long time if the Week season is selected. So, remember to acknowledge any resulting incidents.)

Previous Anomalous Samples

If any single previous value of the eight data values being sampled was itself an anomaly, that sample is excluded from the detection calculations. However, if more than one of the previous data values were anomalous, those values are used in the detection calculations. This is how the threshold check dynamically adapts to gradual changes in behavior that are, in fact, perfectly normal.

Minimum Sample Values

Each sampled data point must also be of at least a minimum value to even be checked for anomalous behavior (configured in the check using the Min Value field). The anomaly engine will never use values below this setting (they will simply be dropped from the data set). This is to prevent a sequence of extremely low values from producing false positives from only minor deviations. (Changing from an average of 1 to 0.001 is a 1000% deviation, but still not likely to be any kind of problem.)

However, if a polled value is below the minimum setting after an anomaly is detected and an alarm generated, the current alarm is automatically cleared, and the opened incident is notified. This is for housekeeping purposes to prevent a detected upper boundary anomaly from causing the check to be stuck in an alarm state due to never processing the current data point if the polled values remain below the minimum.

Unfortunately, this same logic also causes a lower boundary anomaly alarm to clear, negating the value of setting a lower boundary check. It is therefore recommended to exercise caution when using a minimum value with a lower boundary check, as polled values below this setting will both prevent a lower boundary anomaly from triggering an alarm and cause any existing alarm to be cleared.

(The minimum value setting is only for anomaly detection and does not interfere with static threshold check operations.)

Configuration and Management

Add a Threshold Check to a Device

Use Device Templates for Monitoring Checks
In general, it is not recommended to add monitoring checks directly to devices. Instead, use device templates to add these checks. Device templates allow you to automate the process of adding the appropriate monitoring checks to the right devices.

To add a threshold check directly to a managed device, follow the procedure below.

(Note: When a threshold check is added to a device, it cannot be removed. It can only be disabled.)

See also, Add a Threshold Check to a Device Template.

  1. Log in to Netreo as a user with the Admin access level or higher.
  2. Locate the device to which you would like to add a threshold check and select it to open its device dashboard.
    • Specific devices can be located in Netreo by either drilling into a Tactical Overview dashboard widget or searching for the device by name using the search feature at the top of the main menu.
  3. Select the gear icon in the top right of the dashboard to open the dashboard administrative view.
  4. Select the Instances tab.
  5. Locate the panel for the statistic type containing the statistic you would like to monitor and select it to open the panel.
    • If the statistic you would like to monitor is in the Network panel, use the pull-down menu at the top right and select Thresholds to display the network interfaces.
  6. Locate the specific statistic to which you would like to add a threshold check and select its add threshold icon (+) in the ACTIONS column.
  7. In the ACTION GROUP field, select the action group(s) to receive alert notifications  before escalation.
  8. In the ESCALATION GROUP field, select the action group(s) to receive alert notifications after escalation.
  9. In the RENOTIFICATION INTERVAL field, enter the number of minutes for Netreo to wait before sending another alert notification if the problem is not acknowledged by a user.
    • Alert notifications are sent to the action groups in the ACTION GROUP field.
    • The default value of 1440 minutes (24 hours) is recommended to minimize alert noise.
    • Setting a value of 0 (zero) will disable renotifications.
  10. In the ESCALATE AT field, enter the number of alert notifications after the first for Netreo to wait before sending alert notifications to the action groups in the ESCALATION GROUP field, as well as to the groups in the ACTION GROUP field.
    • The default value of 1 means that a total of 2 alerts must be sent before escalation groups start receiving them.
  11. In the STATISTICAL GROUP field, select the type that has the greatest relevance to the check. This field determines which statistical calculations this check contributes to for reports.
  12. (Optional) In the SUBSTRING field, enter a string or regular expression to include or exclude specific interfaces from this check using a match to the interface name/description.
    • If this field is left empty, Netreo attempts to add the configured threshold check to every interface of the device it is applied to.
  13. If you would like to configure static threshold  monitoring (repeat these steps for each variable if two variables are present):
    1. (Optional) In the HIGH warning field (yellow), enter the exact value at which the check should enter the WARNING state for high values.
      • Next to the value type, select the multiplier prefix.
    2. (Optional) In the HIGH critical field (red), enter the exact value at which the check should enter the CRITICAL state for high values.
      • Next to the value type, select the multiplier prefix.
    3. (Optional) In the LOW warning field (yellow), enter the exact value at which the check should enter the WARNING state for low values.
      • Next to the value type, select the multiplier prefix.
    4. (Optional) In the LOW critical field (red), enter the exact value at which the check should enter the CRITICAL state for low values.
      • Next to the value type, select the multiplier prefix.
    5. In the TIME PERIOD field, select the time period over which data values will be sampled for the calculated average.
      • See Best Practices below for best practices regarding threshold check time periods.
  14. If you would like to configure anomaly detection(repeat these steps for each variable if two variables are present):
    1. In the Boundary field, select whether to check for upper boundary anomalies, lower boundary anomalies, or both.
    2. In the Sensitivity warning field (yellow), select the desired sensitivity. (This should always be at least one setting higher than the critical sensitivity field so that the warning state occurs first.)
    3. In the Sensitivity critical field (red), select the desired sensitivity. (This should always be at least one setting lower than the warning sensitivity field so that the warning state occurs first.)
    4. In the Season field, select the desired season for the data samples.
    5. (Optional) In the Min Value field, set the minimum value that a polled value must be to qualify for anomaly detection.
      • The value entered in this field should be specified in the same base unit displayed in the static threshold configuration without the prefix (for example, bytes, not megabytes; seconds, not milliseconds). Note: For bandwidth monitoring (only), the value must be specified in bits per second and not as a percentage.
  15. Select Create Threshold.

Disable a Threshold Check on a Single Device

Currently being revised.

Disabling a threshold check prevents that specific check from monitoring its statistic, but the statistic is still polled for values and those values are still recorded by Netreo.

Disable Threshold Checks on Multiple Devices

To disable multiple specific threshold checks on multiple specific managed devices, follow the procedure below.

Disabling a threshold check prevents that specific check from monitoring its statistic, but the statistic is still polled for values and those values are still recorded by Netreo.

  1. Log in to Netreo as a user with the Admin access level or higher.
  2. Go to the main menu and select Administration > Change Devices > Turn On/Off Thresholds to open the Deactivate Thresholds page.
  3. Select a functional group that contains the devices you would like to affect.
  4. Place a check next to the specific devices on which you would like to disable specific threshold checks.
  5. Select Select Device.
  6. Place a check next to the specific threshold checks that you would like to disable.
  7. Select Update Thresholds.

Best Practices

Device Templates

It is highly recommended that threshold checks be added to devices and managed through device templates and not directly on devices. Even in unique device-specific circumstances, threshold checks for that device can still be managed using a device template that includes the desired threshold checks and is assigned directly to the device.

The only circumstance under which a threshold check should ever be added to a device directly is when that device has had its device template functionality turned off completely.

Time Periods for Static Threshold Checks

Because Netreo polls and records a statistic's value every five minutes, selecting a TIME PERIOD of 5 Min when configuring a static threshold check means that it would take only one poll that exceeded the warning or critical threshold values to trigger a change in state. However, selecting a period of 15 Min would require three consecutive polls (with an average value exceeding the warning or critical threshold values) to trigger a state change. This field is an important adjustment for reducing false alarms.

Divide the TIME PERIOD value configured in the check by 5 to figure out the number of recent samples that will be averaged before being compared to the threshold values.


Was this article helpful?

What's Next