Configuring your system for high availability

To recover from outages that might include planned events (such as maintenance and backups) and unplanned events (such as software failures, hardware failures, power failures, and disasters) you can configure an IBM® Watson IoT Platform - Message Gateway server to serve as a backup.

You might want to configure your system for high availability (HA) in case the primary server experiences a failure. If your server is configured to be highly available, the standby server automatically takes over as the new primary server if the original primary server fails. After the original primary is available again, it acts as the standby server.

Connecting a pair of servers for high availability

Servers are identified as

Primary: The server that is processing messages.
Standby: The backup server to which the primary server is replicated.

Configuring server parameters to support HA

To use the HA feature, you must enable high availability and assign a group name on both servers. You must consider the values that you want to assign to the following parameters before you configure a pair of servers for HA:

Group

Group is used to identify which server to pair with. The value must be the same on both servers. The maximum length of this value is 128 characters.

Startup mode

A node can be set in either auto-detect or stand-alone mode.

In auto-detect mode, two nodes must be started. The nodes automatically try to detect one another and establish an HA pair.
Use stand-alone mode only when you are starting a single node. Stand-alone mode is used to bring up a single node with the intention to later bring up another node that synchronizes with the first node and creates an HA pair.

Local Discovery Address, Local Replication Address and Remote Discovery Address

In an HA environment, you must have two network interfaces for each server - a replication interface and a discovery interface.

The remote discovery address is the IP address of the discovery interface on the remote node in the HA pair.

The local discovery and local replication addresses are the IP addresses of the discovery and replication interfaces of the local node.

You can choose the IP addresses that you assign to these interfaces, providing that the following criteria are met:

The IP addresses that are assigned to the discovery interface on server A and server B are on the same subnet (for example, subnet 1).
The IP addresses that are assigned to the replication interface on server A and server B are on the same subnet (for example, subnet 2).
Subnet 1 and subnet 2 are not the same.

For example, you can configure the discovery and replication interfaces for server A and server B in the following way:

server A: Discovery interface: 192.0.20.10/24; Replication interface: 192.0.30.11/24
server B: Discovery interface: 192.0.20.20/24; Replication interface: 192.0.30.21/24

where 192.0.20 and 192.0.30 are subnets.

Timeouts: The discovery timeout is the amount of time, in seconds, within which a server that is started in auto-detect mode must connect to the other server in the HA pair. The valid range is 10 - 2147483647. The default is 600. If the connection is not made within that time, the server starts in maintenance mode.; The heartbeat timeout is the amount of time, in seconds, within which a server must determine whether the other server in the HA pair fails. The valid range is 1 - 2147483647. The default is 10. If the primary server does not receive a heartbeat from the standby server within this time, it continues to work as the primary server but the data synchronization process is stopped. If the standby server does not receive a heartbeat from the primary server within this time , the standby server becomes the primary server.

Clock synchronization and MaxMessageTimeToLive: Consider the value of the MaxMessageTimeToLive parameter on your messaging policies when you synchronize the clocks on the servers in the HA pair. This parameter specifies the maximum time that a put or published message can exist for in IBM Watson IoT Platform - Message Gateway. The smaller the value of this parameter is, the closer you need to synchronize the clocks on the HA pair. For example, if you set a MaxMessageTimeToLive value of 3600, which equates to 1 hour, you can synchronize the clocks on the HA pair to within 1 or 2 minutes of each other. If you set a MaxMessageTimeToLive value of 1, which equates to 1 second, synchronize the clocks on the HA pair to within 100 milliseconds. Use a network time protocol (NTP) server to keep the clocks synchronized.

After the servers are configured, you must stop and restart the IBM Watson IoT Platform - Message Gateway server.