Service Check
  • 08 Aug 2024
  • 17 Minutes to read
  • Dark
    Light
  • PDF

Service Check

  • Dark
    Light
  • PDF

Article summary

Description

A service check is an application-level monitoring check added to a managed device in Netreo that monitors the status of an individual process or resource (application, interface, etc.). The statuses of service checks are displayed in various locations in Netreo, but by far the most common is in a Tactical Overview dashboard widget.

Each service check runs repeatedly according to a configurable schedule (default timing is 3 minutes between executions). Netreo staggers the execution of service checks to prevent large numbers of checks from all running simultaneously, which could produce prohibitive amounts of traffic on a network.

When run, a service check queries its target (the host running the service) for a response code. If the target returns a failure code (or no code) the check displays as failed in Netreo dashboards and (typically) generates an alarm. Additionally, a service check can be configured to run commands on its target when the check fails (such as reboot or restart-service commands). This provides the possibility of resolving some issues automatically without the need to involve personnel.

For issues that cannot be resolved automatically, users have the ability to acknowledge failing service checks to indicate that they are currently working on the problem. Service checks that have been acknowledged by a user are visually indicated in dashboards.

By default, Netreo automatically adds several service checks to every managed device (through the "Default" and "Windows Default" device templates) to provide basic monitoring services. However, many more service checks may be added to suit your specific monitoring needs.

Service checks are categorized into the following basic types.

  • Cloud Checks - For monitoring cloud-based resources.
  • Firewall Checks - For monitoring firewall resources.
  • Generic Passive Checks - For monitoring resources via an indirect source.
  • HP Insight Manager Agent Checks - For monitoring certain hardware systems.
  • Interface Checks - For monitoring interfaces.
  • Network Application Checks - For monitoring network application resources.
  • Network Connectivity Checks - For monitoring various types of connectivity.
  • Netreo Checks - For monitoring elements of Netreo itself.
  • System Checks - For monitoring core processes.
  • Web Checks - For monitoring web-based resources.

Details

Service Check States

Service checks always display one of the following states when viewed in dashboard widgets.

StateDescription
OK(Green) The check query has returned a success code.
WARNING(Yellow) Very rare. The check query has returned a warning code. For all operations involving service checks, warning codes are treated as failure codes. Soft WARNING states mean the check is still determining if a problem is real or just momentary. Hard WARNING states generate an alarm (unlike the WARNING state of a threshold check). (See "Service Check Alarms" below.)
CRITICAL(Red) The check query has returned a failure code or no code. Soft CRITICAL states mean the check is still determining if a problem is real or just momentary. Hard CRITICAL states generate an alarm. (See "Service Check Alarms" below.)
ACKNOWLEDGED(Blue) Indicates a service check that is in a CRITICAL or WARNING state, but that has been acknowledged by a user. This state is technically an incident state (not a check state) but its display in the dashboards helps to distinguish between problems that are new and problems that are already being addressed.
UNKNOWN(Orange) The check query has returned a value that the check cannot understand. This is likely due to an error in the check's configuration. Generates an alarm. (See "Service Check Alarms" below.)

The Tactical Overview dashboard widget is useful for displaying service check statuses for groups of devices.

Active and Passive Service Checks

Service checks are either active or passive. This is discussed in more detail below, but briefly:

  • Active service checks create their own processes in memory while they do their work and follow their own timing schedule for their query.
  • Passive service checks wait for some other process to update them (usually an active service check or some other active process). This means that they update according to the schedule of whatever process is updating them.

Whether a given service check is active or passive can be seen on the Service Check Administration page (Administration > Change Devices > Manage Service Checks from the main menu).

Active Service Checks

Active service checks create their own process in memory and actively query their specific target process or resource for a response. The response code returned to the service check determines the state of that check.

If the response is a success code, the check remains in the OK state and continues to run its query according to its configured schedule.

If the response is a failure code, the service check enters what is called a soft CRITICAL state. While in this soft state, the service check retries its query several times (typically at a faster frequency). When it reaches a set number of failure responses (default is 3), the check then enters what is called a hard CRITICAL state and generates an alarm (new alarms open an incident in Netreo). Warning response codes follow this same pattern for WARNING states. (There is no visual distinction between hard and soft states in the dashboard indicators, but the Services tab of the Device Dashboard for the failing device shows the history of soft and hard states for the check - including its current state.)

A service check in a hard state continues to retry its query at the (typically increased) configured frequency. If, at any time, it again receives a success code, it immediately recovers to the OK state, clears its alarm, and signals any opened incident that it has recovered.

Service checks are designed to retry their query several times before generating an alarm in order to prevent them from immediately alerting users at the slightest temporary glitch. The retry schedule and the number of failures required to generate an alarm is adjustable in the configuration options of each service check.

Passive Service Checks

Passive service checks do not create their own process in memory. They do nothing until they receive a response code from the active process that updates them. This means that they always remain in their current state (whatever that state might be) until the active process that updates them provides them with a response code that changes their state.

If a passive service check is updated with a success code, the check remains in the OK state and waits for the next update.

If a passive service check is updated with a failure code, it increments its "exception counter" and enters the soft CRITICAL state. When a set number of exceptions occur (default is 3), the check enters the hard CRITICAL state and generates an alarm (new alarms open an incident in Netreo). Warning response codes follow this same pattern for WARNING states. (There is no visual distinction between hard and soft states in the dashboard indicators, but the Services tab of the Device Dashboard for the failing device shows the history of soft and hard states for the check - including its current state.)

If the check is then updated with a success code, it immediately recovers to the OK state, clears its exception counter to zero, clears its alarm, and signals any opened incident that it has recovered. (Note: A passive service check can become stuck in its current state if it is never updated.)

Like active service checks, passive service checks are designed to require a number of exceptions before generating an alarm in order to prevent them from immediately alerting users at the slightest temporary glitch. The number of exceptions required to generate an alarm is adjustable in the configuration options of each passive check.

For using passive service checks with Netreo APIs, see Generic Passive Check (Service Check)

Service Check Alarms

It is important to remember that although a service check may be showing as failed in a dashboard, an alarm is not generated (and thus, an incident is not opened) until the check reaches a hard failed state (as explained in Active and Passive Service Checks above).

A newly generated alarm always attempts to open a new incident in Netreo - although this may be prevented by the check's host checking logic (see below) or Netreo's incident management system for housekeeping purposes, such as if an incident already exists for the current issue.

Host Availability Check

In order for Netreo to monitor the network availability of any given managed device, a service check must be added to it specifically for that purpose.

The "Default" device template automatically adds a "Ping this host" network connectivity service check (named "PING") to all managed devices for the purpose of availability monitoring. Any time you see a reference to a "host availability check," it is referring to the specific service check added to a given resource to monitor its availability on the network.

Certain monitored resources, however, may not respond to the "Ping this host" service check and require you to add a different service check to monitor that resource's availability (such as "Check TCP Port," used when ICMP requests are not allowed). This can be easily accomplished by using a customized device template applied to the particular device type of the resource. (Note: The service check you add to your custom device template to act as the host availability check must be given the name "PING" in its description field so that it overrides the "Ping this host" check provided in the "Default" device template.)

Non-"pingable" resources
If the "Ping this host" service check doesn't work for a particular managed device, you will need to add a service check of an appropriate type for that resource to act as the host availability check (for example, a "Check TCP Port" service check). This should preferably be done using a device template.

Whatever service check is ultimately used for a given resource as the host availability check, that is the service check that is executed during a "host check" (see Host Check below).

Host Check

(Note: A host check is not a type of service check. It refers to an internal system behavior related to service checks that is used to assist Netreo's incident management system.)

The term "host check" refers to the unscheduled execution of a normally scheduled host availability service check (see Host Availability Check above) already assigned to a particular resource for the purpose of immediately determining if that host is up or down. (Netreo is simply checking a host for its availability status at that moment, thus the term "host check.")

The host-checking behavior is used by Netreo's incident management system for the purposes of alert-noise reduction and root-cause analysis and relies heavily on proper device parenting.

A "host check" is triggered automatically when any service check enters a hard CRITICAL or WARNING state and generates an alarm. The hosts to be checked include the owner of the failed service check and, if that host is down, all of that host's immediate parents in the network hierarchy.

If the host owning the failed service check is determined to be "up," that information is provided to incident management.

If the host owning the failed service check is determined to be "down," then that information is provided to incident management, and the host availability service checks of any immediate parents of that host are also checked.

This process continues up the network hierarchy until Netreo discovers a parent who is up or a host that does not have any parents configured. The downed host that is the child of the available host is the root cause of the problem. That information (and the list of downed hosts) is then provided to incident management, which uses it to suppress unnecessary alert notifications.

Due to its nature, the host checking behavior is triggered by failing service checks only. No other types of monitoring check provokes the host checking behavior.

See Incident Management for more information on how host checking is used by the incident management system.

Configuration

The configurable fields for a service check can be separated into two categories:

  • Fields that are common to all service checks
  • Fields that are specific to that service check

Fields that are common to all service checks are covered below. Fields that are specific to an individual service check are covered in the documentation for that service check. See the Service Check List for individual service checks.

Fields Common to All Service Checks

These configuration fields are included on all service checks, usually after the check-specific fields.

  • DESCRIPTION - This field specifies a name for this check. It is used to identify this specific check from among other service checks of the same type that may have been added to the same host. The name entered must be unique among service check names on the host to which it is added (the name entered here may be used again only on a different host). See also Best Practices below. There are a small number of service checks that do not include this field.
  • ALERT AFTER
    • If the check is a passive service check, select the number of failures the check is allowed to experience before sending an alert notification (default is 3).
    • If the check is an active service check, select one of the preset timers that determine how long Netreo will wait after the first detection of a problem by this check before sending an alert notification. (The default value of 5 Minutes is recommended.) Or select Custom to use the custom alert timing fields below. See also Best Practices below.
      • CHECK INTERVAL - Enter the number of minutes to wait between executions of the service check under normal conditions (default is 3).
      • ON FAILURE, RETRY EVERY - Enter the number of minutes to wait between executions of the service check after a failure (default is 1).
      • TOTAL FAILURES BEFORE ALERT - Enter the total number of failures the check is allowed to experience before sending an alert notification (default is 3).
  • RENOTIFICATION INTERVAL- Enter the number of minutes for Netreo to wait before sending additional alert notifications  if the problem is not acknowledged by a user.
    • The default value of 1440 minutes (24 hours) is recommended to minimize alert noise.
    • Setting a value of 0 (zero) will disable renotifications.
    • Alert notifications are sent to the action group(s) selected in the ACTION GROUP field below.
  • ESCALATE AT- Enter the number of alert notifications after the first for Netreo to wait before sending alert notifications to the action group(s)in the ESCALATION GROUP field, as well as to the groups in the ACTION GROUP field.
    • The default value of 1 means that a total of 2 alerts must be sent before escalation groups start receiving them.
  • ACTION GROUP - Select the action group(s) to receive alert notifications before and during escalation.
  • ESCALATION GROUP - Select the action group(s) to receive alert notifications after escalation.
  • STATISTICAL GROUP - Select the type that has the greatest relevance to the check. This field is used to group results in reports.
  • NOTES - Enter any notes that you would like included in any alert notifications about this check.
Delays in receiving alert notifications
The time value used for the RENOTIFICATION INTERVAL settings is an approximation. It is subject to delays based on the performance of your hardware and network, and your current volume of active incidents. A large number of active incidents can cause considerable delays to Netreo in sending alert renotifications. Therefore, it is best not to leave large numbers of incidents in the open or acknowledged states. Try to address incidents as soon as possible so that they will close, or disable managed devices with chronic issues.

Best Practices

Device Templates

It is highly recommended that service checks be added to devices and managed through device templates and not directly on devices. Even in unique device-specific circumstances, service checks for that device can still be managed using a device template that includes the desired service checks and is assigned directly to the device.

The reason for this is that any service check added directly to a device runs the risk of being overridden by a device template applied to that device that includes a service check with an identical description field. If this occurs, the service check added to the device directly will be overridden. Device template settings always override any settings made directly on a device.

The only circumstance under which a service check should ever be added to a device directly is when that device has had its device template functionality turned off completely.

Service Check Names

It is not allowed for two or more service checks on a single device to have the same DESCRIPTION field value. This value acts as the service check name in dashboards and alert notifications. So, be sure to provide unique names for your service checks when creating them.

Best practice here is to enter a descriptive name that indicates what the check is doing along with any specifics of what it's doing it to. As an extremely basic example, suppose you have two TCP port checks being added to the same device, one checking port 80 and the other checking port 110. Best practice would be to name the first check "TCP port 80 check" and the other check "TCP port 110 check." This way, each check will be clearly identifiable everywhere, from dashboards to alert notifications.

Unique service check names are particularly important if you intend to override service check settings using device templates. A service check in a device template will only override another service check if the DESCRIPTION field matches exactly. So, be aware of this when configuring service checks in your device templates.

Custom Alert Timing

The following only applies to active service checks, as passive service checks have their schedule controlled by whatever is updating them.

Setting ALERT AFTER
Using the default 5 Minutes selection Netreo will execute the service check query every 3 minutes until a failure is detected. Once a failure is detected, it will execute the query two more times at 1-minute intervals, leading to a worst-case alert notification response time of five minutes. Although you certainly may use the Custom selection for this field, it's highly recommended that you do not do so without a very specific reason. The selection of choices available for the ALERT AFTER field should be adequate for most situations.

Setting CHECK INTERVAL
This field defines how often (in minutes) this service check will be executed under normal circumstances. After every successful query, Netreo will wait for this interval before it executes the query again. There is a significant performance consideration for this field in that if you're executing 10,000 service checks at 1-minute intervals, Netreo will have to execute 167 checks per second—adding significant network traffic and system load. Use common sense and try to select a reasonable interval. Netreo will try to spread the queries out anyway—so they don't all run at the same time, but you can still overwhelm your network by overdoing the number of configured service checks.

The lowest that you'll generally ever want to set the this setting to is 3 Minutes (especially if the system is very heavily utilized). You may go lower, but the more frequently you execute the query the heavier the load on the system, and the more network overhead required to perform them.

Service Check Overload
It is recommended to limit the number of service checks configured with CHECK INTERVAL settings below 3 minutes, especially in large environments. A few 1-minute CHECK INTERVAL settings on a moderately loaded server is no big deal, but configuring 10,000 service checks at one-minute intervals is going to create massive load and network traffic. That's 167 queries per second!

Setting ON FAILURE, RETRY EVERY
This field defines the amount of time (in minutes) Netreo will wait to retry the query after an initial failure (during the soft state). This period should generally be considerably shorter than the CHECK INTERVAL period. Netreo will continue to retry the query at this interval even after an alert has been sent. If any of the retry queries return a success code, the check will stop retrying, clear any current alarm, and return to the normal CHECK INTERVAL schedule.

Setting TOTAL FAILURES BEFORE ALERT
This field defines the maximum number of failed queries allowed to qualify for an alert. When the total number of failed queries (initial failure plus retries) reaches this number, the service check enters a hard state. At this point, an alert notification is sent. The service check will continue to query according to the ON FAILURE, RETRY EVERY timer value. It is recommended that you do not set this option to 1, as that will generate a significantly higher number of false alarms.

Common Values
If you need to be alerted to an outage immediately, you'll probably want to go with the following custom settings.

  • CHECK INTERVAL = 3
  • ON FAILURE, RETRY EVERY = 1
  • TOTAL FAILURES BEFORE ALERT = 1

However, such a configuration means no soft state. This means that Netreo won't do any verification to ensure that a problem is real before it sends an alert notification. Users have done this in the past and then complained that Netreo was spamming them with alert notifications. So be careful.

Another common configuration is as follows.

  • CHECK INTERVAL = 2
  • ON FAILURE, RETRY EVERY = 1
  • TOTAL FAILURES BEFORE ALERT = 2

If you do the math for such a configuration, the maximum possible time between a service outage and an alert is 3 minutes. It works well, but remember the potential load problems of setting the CHECK INTERVAL to three minutes or below.

It is always recommended to avoid a TOTAL FAILURES BEFORE ALERT setting of 1. As any little hiccup on the network (like a lost ping packet) will immediately send an alert notification—which is probably not what you want if you're looking to minimize false alarms.


Was this article helpful?