Failover steps when active site is inaccessible

Disable network access to your failed active data center, and promote your warm-standby data center to active.

About this task

Follow the steps in this procedure when your active site is inaccessible, or the API Connect subsystems in your active site are not responding to apicup operations.

Procedure

  1. Update the network configuration in your failed data center to isolate the API Connect subsystems in this data center, so that they cannot communicate with each other, nor with the warm-standby data center. Network isolation is necessary to prevent a split-brain situation, which occurs if your active site recovers unexpectedly and starts communicating with your other subsystems.
    Both Active and Passive sites shows the following HA status in the management CR
    HA status PeerNotReachable - see HAStatus in CR for details
    Example Passive site CR output:
    NAME     READY   STATUS    VERSION    RECONCILED VERSION   MESSAGE                                                       AGE
    mgmtdr   7/23    Blocked   10.0.8.3   10.0.8.3-2844        HA status PeerNotReachable - see HAStatus in CR for details   36d

  2. Promote your warm-standby management subsystem to active:
    apicup subsys set <mgmt_subsystem> multi-site-ha-mode=active
    Apply the change with:
    apicup subsys install <mgmt_subsystem> --force-promotion --skip-health-check
    Monitor the progress of the promotion with:
    apicup subsys health-check <mgmt_subsystem>
    Expected CR output after successful promotion:
    NAME     READY   STATUS    VERSION    RECONCILED VERSION   MESSAGE                                                                            AGE
    mgmtdr   23/23   Running   10.0.8.3   10.0.8.3-2844        Management is ready. HA status PeerNotReachable - see HAStatus in CR for details   36d
    Important:
    • The process of promoting the passive site to active can take a long time up to 30 minutes in an average environment, and longer in slower environments.
    • The health-check output displays multiple error messages before the promotion completes.
    • Do not proceed to the next step until the promotion to active has completed and the health-check command returns that the system is Running.
    • Even after completion, the system still shows an error state because the network is down.
    Use the -v flag to see more information:
    apicup subsys health-check <subsystem name> -v
  3. Promote your warm-standby portal subsystem to active:
    apicup subsys set <portal_subsystem> multi-site-ha-mode=active
    Apply the change with:
    apicup subsys install <portal_subsystem>
    Monitor the progress of the promotion with:
    apicup subsys health-check <mgmt_subsystem>
    when the command returns no output, the promotion to active is complete.
    Use the -v flag to see more information:
    apicup subsys health-check <subsystem name> -v
  4. Update your dynamic router to redirect all traffic to DC2 instead of DC1.

What to do next

After failover, select the appropriate recovery path based on whether the failed data center can be recovered.
If the failed data center cannot be recovered
Do not leave your remaining data center as an active 2DCDR deployment without a functioning warm-standby. You have the following options:
If you are able to recover your failed data center
Before re-enabling network access, convert the management and portal subsystems in the failed data center to warm-standby:
  1. Set the multi-site-ha-mode property to passive for the management subsystem in DC1:
    apicup subsys set <DC1 management> multi-site-ha-mode=passive
  2. CAUTION:
    This is a destructive command that will cause data loss if used incorrectly or in the wrong environment. Ensure you are demoting the correct environment before proceeding.
    Apply the update to convert the management subsystem in DC1 to passive:
    apicup subsys install <DC1 management> --skip-health-check --accept-dr-data-deletion --force-demotion
    Note:
    • --skip-health-check bypasses health validation since the network is down.
    • --accept-dr-data-deletion acknowledges that all contents of the management database in DC1 will be deleted and replaced during synchronization.
    • --force-demotion ensures the subsystem is demoted even if the peer is unreachable.
    • When an active management subsystem is converted to warm-standby, all contents of its management database are deleted (to be replaced by the contents from the other data center when it becomes the active). The --accept-dr-data-deletion flag is acknowledgment that you accept this temporary loss of data.
  3. Monitor the progress of the conversion to warm-standby:
    apicup subsys health-check <DC1 management> -v
    Note: Demotion can take time. Monitor using the health-check command until the output looks similar to:
    ./apicup subsys health-check mgmtv10
    Error: Cluster not in good health:
    ManagementCluster (specified multi site ha mode: passive, current ha mode: WarmStandbyError, ha status: PeerNotReachable, ha message: multi site peer is not reachable. Warm Standby is enabled, cannot proceed further without peer communication. requeuing after 10 sec) is not Ready or Complete | State: 21/22 | Phase: Blocked | Message: HA status PeerNotReachable - see HAStatus in CR for details
    Use the --validate command to confirm that the environment is now set to passive:
    ./apicup subsys get mgmtv10 --validate
  4. Set the multi-site-ha-mode property to passive for the portal subsystem in DC1:
    apicup subsys set <DC1 portal> multi-site-ha-mode=passive
  5. Apply the update to DC1:
    apicup subsys install <DC1 portal> --skip-health-check
  6. Monitor the progress of the conversion to warm-standby:
    apicup subsys health-check <DC1 portal> -v
  7. When you have confirmed that both management and portal subsystems in DC1 are set to warm-standby, you can re-enable the network to the DC1 API Connect subsystems, and the two data centers should synchronize their API Connect data.
    Monitor the health at both data centers with:
    apicup subsys health-check <DC1 management> -v