- 09 Jan 2023
- 2 Minutes to read
- Print
- DarkLight
- PDF
How do I monitor a cluster of Windows Servers?
- Updated on 09 Jan 2023
- 2 Minutes to read
- Print
- DarkLight
- PDF
This can be confusing, because it is not a very straightforward situation. First let's get a high-level overview of how the clusters work.
Usually there are two or more physical servers. Each running the cluster service and logically, if not physically, grouped together. Each server has its own network identity and running services.
Then through the cluster manager, one or more virtual IP addresses are assigned, as well as “cluster aware services.” These services can be database instances, web servers, shared SAN storage or a host of other things. The idea being that these are the critical resources that need to survive in the event that any member of the cluster experiences a hardware failure. If the server that currently hosts the cluster service fails, the service is supposed to seamlessly transition to a different available cluster member. This presents a unique challenge when monitoring the servers.
If you monitor just the physical servers, then the cluster instances like drives, open ports and cluster interfaces only appear on the server hosting them. But if there is a failover event, intentional or not, then you are no longer able to poll these instances, since they will no longer exist on that server. This causes problems with historical continuity of statistics, threshold check settings and a number of other things. Additionally, you would need to repoll all cluster members right after the failover event to make sure that the server that the resources have moved to now sees those instances and are polling them—not to mention setting alerting and threshold check parameters on them.
If you choose to monitor only the virtual IP address, you will lose visibility into server-level faults and threshold checks, since you will effectively only be monitoring the server that has control of the virtual interface. This will render CPU, interface, memory, and (local) drive utilization stats useless, since one day the stat will reflect those values on one cluster member and another day (if there was a failover event) they could be the values of a different cluster member.
What we recommend is a hybrid of both. Monitor both the physical servers as well as the virtual cluster interfaces. Set fault and threshold check alerting on the physical servers to instances that are local to the physical server (CPU, memory, BW, errors, and drives and services that are local to the server itself and not cluster aware).
Then disable any cluster threshold checks that are automatically created. On the virtual server, do the opposite and configure your status checks, alert contacts and threshold check alerting for only those services and drives (like the Quorum drive and SAN drives) that are associated with the cluster and will transition to different cluster members in the event of a failover.
This is easiest to do by creating two device templates (one for physical cluster members, and one for virtual cluster addresses) and applying them appropriately.
You can then include cluster members in a strategic group in order to get visualization of the cluster environment, as seen in this example:
Optionally, you can upload a cluster functional diagram and have it reflect Netreo availability data using our Custom Maps functionality, like this: