High Availability (HA) Deployment
  • 17 Apr 2024
  • 9 Minutes to read
  • Dark
    Light
  • PDF

High Availability (HA) Deployment

  • Dark
    Light
  • PDF

Article Summary

Description

A standard Netreo high availability (HA) setup consists of three Netreo appliances deployed in a cluster arrangement: the primary appliance, the replica appliance, and the arbitrator appliance.

The primary appliance is the "main" Netreo appliance that would normally be providing production services. The replica appliance is the "backup" Netreo appliance that will be activated in the event that the primary fails. The arbitrator is a third Netreo appliance (with much lower resource requirements than the primary and replica appliances) that acts to provide quorum for the cluster in the event of a failure. The arbitrator also helps to reduce the stress of database replication on the primary during initial HA data synchronization.

A Netreo HA setup is not the same as other network HA setups you may be familiar with. It is not a group of "equal" nodes sharing common storage memory and negotiating which node should be in charge and provide production services. The terms primary, replica, and arbitrator are used for a reason. Netreo HA is very much focused on the primary being the only intended working Netreo system, with the replica acting specifically as a backup to the primary. The intention is that if the primary fails, monitoring will temporarily be handled by the replica. And, then, only until such time as the primary can be brought back up to resume its role. (This is a manual process that typically requires the assistance of a Netreo support engineer. There is no automatic failback process to reinstate the primary as the production server.) The replica should never be in a position in which it is acting as the "main" Netreo system, because it does not synchronize its SQL database with the other nodes. (The SQL database contains all of Netreo's configuration data.) The arbitrator, on the other hand, will never even act as a Netreo system at all. Its only operational function is to provide third-party arbitration to determine if the primary has actually failed and whether the replica should take over monitoring duties. From this, it should be clear that the HA nodes are not equal. When the primary goes down, HA capability is lost.

HA Preparation

In order for the HA cluster nodes to communicate properly they must be able to connect with each other using the following ports.

ProtocolPortHA Service
TCP443High Availability Communication
TCP4444Database Clustering
TCP4567Database Clustering
TCP4568Database Clustering
TCP48100Instance Backing File Replication

It is important to remember that in order for HA to function properly, all managed devices in the customer environment must allow access from the IP addresses of both the primary and the replica systems. Typical protocols to consider here are SNMP, WMI, WinRM and SSH. However, depending upon the configuration of particular environments—and the feature sets in use—that list could be different. Contact Netreo Support for more information.

Netreo High Availability configurations do not provide virtual IP functionality. Both the primary and the replica systems must have static/permanent IP addresses. In the event of a failover, end-users must access the replica system via its natural IP address. VIP functionality is possible. However, the setup and configuration of that architecture (i.e. DNS, load balancing, etc) are the responsibility of the customer. Contact Netreo Support for more information about this topic.

Deployment

Before attempting to configure HA capability, all three appliances should be deployed and running and the primary configured for production services. Deploy the primary appliance first, according to the standard instructions. Then deploy the arbitrator and replica license appliances. The arbitrator should be deployed physically near the primary, and the two should share the fastest link practical. The replica may be deployed anywhere required (typically offsite for disaster prevention and recovery), but the bandwidth latency between the primary and the replica must be no more than 20ms to ensure reliable operation. To deploy an arbitrator or replica, follow the same installation instructions above until the setup wizard starts. Then only complete the Network Configuration and License Activation sections (be sure to note the IP address for each appliance, as you will need these to initialize HA in the primary). Enter the necessary information along with the correct pin for the arbitrator/replica. Once the appliance has been licensed, you may close the setup wizard. Once all three appliances have been deployed, you're ready to configure HA in the primary.

The Arbitrator
The arbitrator is not an operational Netreo system, and will not have an accessible user interface. Only the primary and replica may be accessed through a UI.

Configuration

Once the primary, arbitrator and replica appliances have all been deployed and started, you're ready to configure HA. In the UI of the primary Netreo appliance, navigate to the "High Availability Configuration" page (Administration > System > High Availability). Enter the IP addresses of the arbitrator and replica into the appropriate fields and click the Add button. The HA initialization process (see below) will immediately begin.

An Initialize HA button will also appear, in the event that you need to re-initialize HA after it has already been configured (such as after a failover). Click this button after a failback has been performed on the system to restart the HA initialization process.

The HA Initialization Process

When the Add or Initialize HA buttons are pressed, the primary appliance will begin synchronizing its MySQL configuration database to the arbitrator. The primary appliance is typically providing production services at this point, so initial synchronization is done only with the arbitrator to reduce the load on the system of the primary. When the synchronization from primary to arbitrator is finished, the arbitrator will then synchronize the database to the replica. Due to the size of the database, and because the replica appliance may be located anywhere, this synchronization can potentially take in excess of 24 hours. While the synchronization is taking place, the "# OF CLUSTER MEMBERS" display on the "High Availability Configuration" page of the primary will begin counting up. When it reaches the total number of Netreo appliances in the HA cluster, synchronization is complete. At that point, HA is now running.

HA Operations

When Netreo is configured for HA you will see a new icon in the icon group at the top right of Netreo. This icon displays the current HA status of that appliance. It can be seen in the UI of both the primary and the replica (arbitrators are not operational Netreo systems and have no UI to display states). The following tables show the icons and respective states for the primary and replica appliances. Some states apply only to the primary or the replica.

 

Primary Appliance HA Link Status

IconColorStateDescription
GreenACTIVEPrimary is polling and its MySQL database is correctly synchronizing with replica.
YellowACTIVEReplica is inactive. HA initialization is not yet complete, or may need to be re-initialized.
RedFAILEDPrimary has failed, causing it to stop polling. It will remain in the FAILED HA state until HA is re-initialized.

 

Replica Appliance HA Link Status

IconColorStateDescription
GreenACTIVEReplica is passively running and its MySQL database is correctly being synchronizing to the primary.
GrayINACTIVEReplica is licensed for HA but hasn’t been configured, or HA has been stopped from the primary's UI.
RedTAKEOVERPrimary has (at some point) failed. Replica is now polling, authorized by the arbitrator. There is at least a 60 second delay while the replica confirms the loss of the primary before it takes over polling.

During HA cluster initialization, no monitored device performance data is written to the RRDs of the replica. So, device historical data on the replica begins at the point that HA becomes fully operational.

During normal operation, while the HA icon on the primary is green, updates to the system configuration data (MySQL database) of the primary are replicated synchronously to the replica and the arbitrator, keeping the configuration of the replica identical to that of the primary. However, updates to the historical device performance data (RRDs) of the primary are replicated asynchronously to the replica. (This means that while the system configuration data is always 100% in sync between cluster members at any given moment, collected device performance historical data may experience slight delays before it is fully synchronized between the primary and the replica.)

The basic difference between synchronous and asynchronous replication is that synchronous guarantees that if changes happened on one node of the cluster, they happened on all other nodes at the same time. The consequence of this is that, if the connection between the primary and other the nodes (typically the replica) is slow, the performance of the primary will be adversely affected, as it waits for the write operation to finish on the other appliances before continuing. This condition applies to the replication of the MySQL database only, as no other data is replicated synchronously (historical data updates are written to the replica's RRDs asynchronously). MySQL updates to the arbitrator are purely for backup purposes.

Traffic Flow Monitoring

If you are using Netreo in an HA configuration and you wish to collect traffic flow statistics (NetFlow, etc.), this must be done using a service engine running the Netreo Traffic Collector service. Additionally, the deployed service engine must not be within the HA cluster.

Log Monitoring

If you are using Netreo in an HA configuration and you wish to collect log statistics (syslog, event logs, etc.), this must be done using a service engine running the Netreo Log Collector service. As with traffic flow monitoring, the deployed service engine must not be within the HA cluster.

HA Failures

While HA is operational, the primary will be continuously synchronizing its configuration settings to the replica so that the replica will be ready to take over production services immediately (called a failover), should the primary fail. A failure of the primary will immediately cause the replica to check and see if the primary is still a member of the HA cluster. If not, the replica will wait for 60 seconds and then check again. If the primary is still not present, the replica will check for quorum with the arbitrator. If quorum exists, the replica enters the TAKEOVER state and takes over production services. Once a failover has occurred, even if the primary rejoins the cluster, it will not resume control of production services. HA must be reinitialized manually (repair the issue, bring up the primary and click the Initialize HA button). Note: This process runs the entire HA initialization process over again. So, however long it took the first time, expect initialization to take at least that long now.

While the primary is in the ACTIVE state (above), a failure of the replica won't affect monitoring, but will cause HA capability to be lost (since the replica is the backup appliance). A failure of the arbitrator won't affect anything, until a failure of the primary also occurs. At which point, there will be a total HA failure—since the primary will have failed and the replica will not attempt to take over if it is the only member of the cluster.

If, during a failure, the primary is not available on the network (link down, crash, etc.), the replica will not be able to write updates to the historical data files of the primary. This means that if a failure occurs, there will typically be a gap in historical data for the entire time the primary was down. However, if the primary again becomes available on the network, the replica will try to write updates to its RRDs. The primary will not, however, attempt to resume control of production services.


Was this article helpful?