Replacing a node in a high availability group

If an appliance that belongs to a high availability (HA) group fails, you can replace the appliance and then restore the HA group by following this procedure. You can also follow this procedure if you want to replace a functioning appliance (for example, with a later model).

Before you begin

When a node in an HA group fails (or is shut down), the queue managers fail over to the remaining appliance in the group. You can continue running the queue managers on the remaining appliance while you replace the failed or shut down appliance.

This procedure only preserves HA queue managers that belong to the HA group on the failed appliance. It does not preserve stand alone queue managers, and you need to take steps to recreate the configuration of any disaster recovery queue managers that existed on the failed appliance (see dspdrsecondary).

To restore high availability function after you replace or repair the failed appliance, you configure the appliance so that it looks like the one it is replacing and then run a recreate HA group command on the remaining appliance in the group.

You must ensure that both appliances are running the same level of firmware. If your new appliance is running a later version of the firmware, you must downgrade your new appliance.

Note: This procedure preserves HA queue managers that were running on the failed or replaced appliance. You must take steps to back up and manually restore any standalone queue managers that were running on that appliance.

Procedure

  • If the appliance you are replacing is running, you can prepare for its replacement by completing the following steps:
    1. Back up the appliance configuration, see Secure backup for release 9.3.5 and Backing up or saving the appliance configuration for earlier releases.
    2. For releases before 9.3.5, back up details of the messaging users, see Backing up messaging users.
    3. Take a note of the Ethernet settings (this information is contained in the configuration backup file).
    4. Use the dsphalink command to check whether the configuration is using the default eth21 for the replication. Make a note if a custom link is used.
    5. Take a note of the appliance name (this information is contained in the configuration backup file).
    6. Take a note of the exact version of firmware that the appliance is running.
    7. Optionally, create a test HA queue manager that can be used for initial testing after the replacement appliance is set up in the HA group.
    8. Shut down the appliance (failing over queue managers to the other appliance in the HA group).
  • To configure the replacement appliance to match the original appliance, complete the following steps:
    1. Install the replacement appliance, ensuring that you connect all the Ethernet cables as they were connected on the old appliance. See Installation of the appliance in a rack.
    2. Configure the appliance by using one of the following methods:
      • Restore from the configuration backup, if you were able to take one. See Secure restore
        Note: If you restore to a later model of appliance, resources that are not applicable to the replacement appliance model are disregarded but might cause startup errors (see startup errors). After you correct the errors you might need to do a write mem in the configuration console and restart the appliance.
      • Run the installation wizard, see Initializing the appliance.
      • Use the command line interface, see Configuring the appliance.
      • Use the web UI, see Configuring the appliance.
      You must configure the appliance to match its predecessor. (If you do not know the appliance name, use the dsphagrp command from the mqcli command line of the other appliance to discover it.)
    3. Install the version of the firmware that the original appliance was running, see Installing new firmware.
    4. For release 9.3.5, restore the backup. See Secure restore. (If you were unable to take backups from your original appliance, you can take backups from the other appliance in the HA group and restore them to your new appliance, see Secure backup.)
    5. Verify that the Ethernet IP addresses and system name configured on the new appliance match those on the original appliance.
  • To prepare the surviving appliance in the HA group:
    1. Back up the queue managers. See Backing up a queue manager.
    2. Change the preferred location setting on all the HA queue managers to this appliance. See Managing queue manager locations in a high availability group.
    3. If your HA queue managers are also configured for disaster recovery (DR), ensure that they have the DR Primary role on this appliance. See Viewing the status of a disaster recovery queue manager. (If any of these queue managers have the Partitioned status, resolve that now rather than waiting until the other appliance is restored.)
  • To recreate the HA group:
    1. On the new appliance, issue the prepareha command from the mqcli command line:
      prepareha -s secret_key -a IP_address
      Where secret_key specifies a string that is used to generate a short-lived password and IP_address Specifies the IP address of the HA group primary interface on the other appliance in the group.
    2. On the other appliance, issue the crthagrp command from the mqcli command line to recreate the HA group:
      crthagrp -s secret_key -r
      The HA group is recreated. The HA queue managers continue to run on the existing appliance while you restore the HA group, and do not fail back to the new appliance unless you have designated the new appliance as the preferred location (or other fail over conditions are met, see Causes of HA failover).
    3. If your original configuration used a custom replication link, use the sethalink command on both appliances to configure the custom link, see Configuring custom HA replication interfaces.
  • To validate the HA group:
    1. Check the output of the crthagrp command and ensure that all the HA queue managers were successfully recreated on the new appliance. (If any of the HA queue managers also had DR configured you should also see messages about DR.)
    2. Check the status of each of the queue managers on the surviving appliance and on the replacement appliance. You should see the HA status as Normal and the DR status as Normal if the queue manager is an HA primary, or no DR status field if it is an HA secondary (repeat the status check if you see the status synchronization in progress).
    3. If you have DR configured, run the dspdrlink command on the replacement appliance and check that there are no errors in the output.
    4. If you created a test queue manager as part of your preparation, try failing it over. Also try creating a new HA queue manager.

Example

In this example, you configure a new appliance named 'CASTOR'. The surviving appliance is named 'POLLUX', and POLLUX is running the HA queue manager 'HA_QM1' and the HA/DR queue manager 'DRHA_QM1'. You prepare CASTOR by running the following command:

prepareha -s SuperSecretPassword -a 192.168.123.200

You then run the following command on POLLUX to recreate the group:

crthagrp -s SuperSecretPassword -r

POLLUX outputs the following messages as the HA group is recreated:

Creating high availability configuration on appliance 'CASTOR'.
Recreating high availability configuration for queue manager 'HA_QM1' on appliance 'CASTOR'.
Recreation completed for queue manager 'HA_QM1' on appliance 'CASTOR'. 
Recreating high availability configuration for queue manager 'DRHA_QM1' on appliance 'CASTOR'.
Recreating disaster recovery configuration for queue manager 'DRHA_QM1' on appliance 'CASTOR'.
Recreation completed for queue manager 'DRHA_QM1' on appliance 'CASTOR'. 
This Appliance: Online
Appliance CASTOR: Online

What to do next

After validating that the appliance replacement is successful and the new appliance fully functional, reset the preferred locations of your HA queue managers as required and recommence normal operation.