How to failover API Connect from the active to the warm-standby data center

How to complete service failover in a 2DCDR deployment on Kubernetes, OpenShift, and Cloud Pak for Integration.

Before you begin

Ensure that you have read and understand the 2DCDR concepts and reviewed the failure scenarios that are described in Key concepts of 2DCDR and failure scenarios. Do not proceed with the failover until you confirm that it is the correct course of action for your situation.

About this task

Carefully follow the steps in this topic. Be aware of the following important points:
  • The first step in the process is to set the active data center to warm-standby. When the active data center is set to warm-standby, all data is deleted from the active data center's management database, to be replaced by data copied from the warm-standby when it becomes active. Do not proceed with failover if you suspect there is also a problem on the warm-standby data center and you are unsure it has the most recent management data. See Verifying replication between data centers, and consider restoring your active site from backup instead of attempting a failover: Backup and restore requirements for a two data center deployment.
  • If the warm-standby data center is offline for more than 24 hours, there can be issues with the disk space on the active data center. In this case, you should revert your deployment to a single data center topology. For more information, see Removing a two data center deployment.
  • Avoid both data centers being set to active at the same time while the network links between data centers are enabled. This causes split-brain.
  • If the active data center failure prevents you from updating the active data center API Connect custom resources (CRs), then you should disable the network links used by API Connect between the data centers.
Note: For OpenShift users: The steps that are detailed in this topic use the Kubernetes kubectl command. On OpenShift, use the equivalent oc command in its place. If you are using a top-level CR you must edit the multiSiteHA section for the subsystem in the top-level CR, instead of directly in the subsystem CRs. If the subsystem section is not included in the top-level CR, copy and paste the section from the subsystem CR to the top-level CR.

Procedure

  • Management subsystem failover. In this example, DC1 is the active and DC2 is the warm-standby.

    To initiate a failover from DC1 to DC2, you first must set DC1 to warm-standby before you set the DC2 data center to active. This action is needed to prevent split-brain, as there can’t be two active data centers at the same time. The following instructions show how to failover DC1 to DC2 for the Management subsystem.

    1. Set the DC1 Management subsystem to be warm-standby.
      Edit the DC1 ManagementCluster custom resource (CR) by running the following command:
      kubectl edit mgmt -n <namespace>
      and change the mode to passive:
      multiSiteHA:
        mode: passive
      
      Note: If API Connect on DC1 is down, such that you can’t set it to be warm-standby, you must ensure that the network links between DC1 and DC2 are disabled before you set DC2 to be active. You must then not restore the network links until you can set DC1 to be warm-standby.
    2. Run the following command on DC1 to check that the Management subsystem HA mode is no longer active:
      kubectl describe mgmt -n <namespace>
      During transition to warm-standby, the Ha mode part of the Status output shows progressing to passive. When the transition is complete, it shows passive.
      Status:
        ...
        Ha mode:                            passive
        ...

      You can proceed to set DC2 to active while DC1 is still progressing to passive.

    3. Change the DC2 Management subsystem from warm-standby to active.
      Note: If API Connect is installed as part of Cloud Pak for Integration, then before you convert your warm-standby to active, complete the following checks:
      1. Check whether a configurator job exists:
        oc -n <namespace> get jobs -l "app.kubernetes.io/managed-by=apiconnect-operator" | grep configurator
        if a configurator job exists, then delete it:
        oc -n <namespace> delete job <configurator job name>
      2. Remove the spec.disabledServicesproperty.configurator property from your APIConnectCluster CR. You added this property during installation step 4 of Installing API Connect on the warm-standby data center
      Check the status.haMode of DC1:
      • If the DC1 status is progressing to passive, then edit the DC2 ManagementCluster CR and set spec.multiSiteHA.mode to active.
      • If the DC1 status is passive, or if DC1 is unavailable, then edit the DC2 ManagementCluster CR and set spec.multiSiteHA.mode to active, and add the annotation annotations.apiconnect-operator/skip-ha-status-ready-check: "true".
      If you see the error message:
      Cannot change mode from active to passive or from passive to active when multiSiteHA system status is not ready.
      then edit the DC2 ManagementCluster CR and add the annotation annotations.apiconnect-operator/skip-ha-status-ready-check: "true".
    4. Run the following command on DC2 to check that the Management subsystem is ready for traffic:
      kubectl describe mgmt -n <namespace>
      The services are ready for traffic when the Ha mode part of the Status object is set on DC2 to active. For example:
      Status:
        ...
        Ha mode:                            active
        ...
      Note: Failover can take 10 minutes or more to complete. If there are any problems, you can check the API Connect operator pod log for errors (search for the word Multi).
    5. Update your dynamic router to redirect all traffic to DC2 instead of DC1. Until the DC2 site becomes active, the UIs might not be operational.
  • Developer Portal subsystem failover

    If you have multiple portal services, you can failover a specific portal service, or all of the portal services that are running in a particular data center.

    The following instructions show how to failover DC1 to DC2 for the Developer Portal subsystem. If you have multiple Developer Portal services, you must repeat these steps for each Developer Portal subsystem that you want to failover.

    Note:
    • Developer Portal failover does result in a temporary Developer Portal website outage until the warm-standby data center is ready to accept traffic.
    • If you want to monitor the failover process, you can run the following command to check the status:
      kubectl describe ptl
    • Do not edit the portal CR file during the failover process.
    1. Set the DC1 Developer Portal subsystem to be warm-standby.
      Edit the DC1 PortalCluster custom resource (CR) by running the following command:
      kubectl edit ptl <ptl-cluster-name> -n <namespace>
      and change the mode to passive:
      multiSiteHA:
        mode: passive
      
      Where <ptl-cluster-name> is the name of your portal cluster if you have more than one. To see a list of your portal clusters, run:
      kubectl get PortalCluster -n <namespace>
      Note: If API Connect on DC1 is down, such that you can’t set it to be warm-standby, you must ensure that the network links between DC1 and DC2 are disabled before you set DC2 to be active. You must then not restore the network links until you can set DC1 to be warm-standby.
    2. Run the following command on DC1 to check the Developer Portal subsystem status:
      kubectl describe ptl <ptl-cluster-name> -n <namespace>
      You can continue to the next step when the Ha mode part of the Status object is set to progressing to passive, or any of the passive states. For example:
      Status:
        ...
        Ha mode:                            progressing to passive
        ...
      Important: You must not set DC2 to active until DC1 is either in passive or progressing to passive state. In other words, DC1 and DC2 must never both be in active state at the same time.
    3. Change the DC2 Developer Portal subsystem from warm-standby to active.
      Edit the DC2 PortalCluster custom resource (CR) by running the following command:
      kubectl edit ptl <ptl-cluster-name> -n <namespace>
      and change the mode to active:
      multiSiteHA:
        mode: active
      
      Where <ptl-cluster-name> is the name of your portal cluster if you have more than one. To see a list of your portal clusters, run:
      kubectl get PortalCluster -n <namespace>
    4. Update your dynamic router to redirect all traffic to DC2 instead of DC1.
    5. Run the following command to check that the Developer Portal services on DC1 and DC2 are ready for traffic:
      kubectl describe ptl <ptl-cluster-name> -n <namespace>
      The services are ready for traffic when the Ha mode part of the Status object is set on DC1 to passive, and is set on DC2 to active. For example, on DC2:
      Status:
        ...
        Ha mode:                            active
        ...
  • Failover of both Management and Portal subsystems

    Failover the Management subsystem first, followed by the Portal subsystem.

    How long it takes to complete the failover varies, and depends on hardware speed, network latency, and the size of the databases. Approximate timings are:

    For the Management subsystem:
    • warm-standby to active: 5 minutes
    • active to warm-standby: 15 minutes
    For the Developer Portal:
    • warm-standby to active: 15 - 40 minutes
    • active to warm-standby: 10 minutes

What to do next

As soon as the failure on the data center is resolved, the failed data center should be brought back online and relinked to the currently active data center to maintain the highest availability; see Recovering from a failover of a two data center deployment for more information.