Failing over to the warm-standby

How to complete service failover in a 2DCDR deployment on Kubernetes, OpenShift, and Cloud Pak for Integration, for API Connect.

Before you begin

Ensure that you have read and understand the 2DCDR concepts and reviewed the failure scenarios that are described in Key concepts of 2DCDR and failure scenarios. Do not proceed with the failover operation until you confirm that failover is the correct course of action for your situation.

About this task

Be aware of the following important points:
  • The first step in the process is to convert the active data center to warm-standby. When the active data center is converted to warm-standby, all data is deleted from the active data center's management database, to be replaced by data that is copied from the warm-standby when it becomes active.

    Do not proceed with failover if you suspect the warm-standby data center also has problems, and you are unsure it has the most recent management data. See Verifying replication between data centers, and consider restoring your active site from backup instead of attempting a failover: Backup and restore requirements for a 2DCDR deployment.

  • In all scenarios, an active-active configuration must be avoided. An active-active configuration is where the API Connect subsystems in both data centers are configured as active. This situation is commonly known as a split-brain. An active-active configuration means that the subsystem databases in each data center diverge from each other, and two management subsystems are both attempting to manage the other API Connect subsystems.
  • If the active data center failure prevents you from converting it to warm-standby, then you must disable the network connectivity to and from the failed management and portal subsystems on this data center, to prevent an accidental active-active situation if your failed data center recovers unexpectedly.
  • If you are doing an operational failover, the process causes a temporary management and portal UI outage, until the new warm-standby completes the conversion to active.
Note: For OpenShift users: The steps that are detailed in this topic use the Kubernetes kubectl command. On OpenShift, use the equivalent oc command in its place. If you are using a top-level CR you must edit the multiSiteHA section for the subsystem in the top-level CR, instead of directly in the subsystem CRs.

Procedure

  • Management subsystem failover

    In these example steps, DC1 is the active data center and DC2 is the warm-standby data center.

    To complete a failover from DC1 to DC2, you first set DC1 to warm-standby before you set DC2 to active. This order is required to prevent an active-active situation.

    1. Set the DC1 management subsystem to be warm-standby.
      • On Kubernetes and OpenShift individual subsystem CR deployments: Create a file called demote-to-warm-standby.yaml, and paste in the following text:
        metadata:
          annotations:
            apiconnect-operator/dr-data-deletion-confirmation: "true"
        spec:
          multiSiteHA:
            mode: passive
        Apply the updates that are specified in the demote-to-warm-standby.yaml file to your management CR:
        kubectl patch mgmt <mgmt cr name> --type merge --patch "$(cat demote-to-warm-standby.yaml)"
      • On Cloud Pak for Integration and OpenShift top-level CR deployments: Create a file called demote-to-warm-standby.yaml, and paste in the following text:
        metadata:
          annotations:
            apiconnect-operator/dr-data-deletion-confirmation: "true"
        spec:
          management:
            multiSiteHA:
              mode: passive
        Apply the updates that are specified in the demote-to-warm-standby.yaml file to your API Connect CR:
        oc patch apiconnectcluster <apic cr name> --type merge --patch "$(cat demote-to-warm-standby.yaml)"
      Note: If API Connect on DC1 is down, such that you can’t set it to be warm-standby, you must ensure that the network connectivity to and from the management subsystem in DC1 is disabled, to prevent an accidental active-active situation if DC1 recovers unexpectedly. Do not restore network connectivity until DC1 is set to warm-standby.
    2. Run the following command on DC1 to check that the management subsystem HA mode is no longer active:
      kubectl get mgmt -o yaml
      During transition to warm-standby, the status.haMode shows BlockedWarmStandbyConversion. You can monitor the transition with:
      kubectl get mgmt -o=jsonpath='{"Status: "}{.status.phase}{"\n - HA Mode: "}{.status.haMode} {"\n - Ha Status: "}{.status.haStatus[?(@.status=="True")].type}{"\n"}'
      Status: Blocked
       - HA Mode: BlockedWarmStandbyConversion
       - Ha Status: BlockedWarmStandbyConversion
    3. Change the DC2 management subsystem from warm-standby to active.
      Check the DC2 management subsystem haStatus:
      kubectl get mgmt -o yaml
      status
        ...    
        haMode: ReadyForPromotion
          haStatus: 
            ...
            type: ReadyForPromotion
        ...
      wait until the haStatus shows ReadyForPromotion before proceeding. You can monitor the transition with:
      kubectl get mgmt -o=jsonpath='{"Status: "}{.status.phase}{"\n - HA Mode: "}{.status.haMode} {"\n - Ha Status: "}{.status.haStatus[?(@.status=="True")].type}{"\n"}'
      Status: Blocked
       - HA Mode: ReadyForPromotion 
       - Ha Status: ReadyForPromotion
      • On Kubernetes and OpenShift individual subsystem CR deployments:
        Edit the DC2 management CR:
        kubectl edit mgmt
        Change spec.multiSiteHA.mode to active:
      • On Cloud Pak for Integration and OpenShift top-level CR deployments:
        Edit the DC2 APIConnectCluster CR:
        oc edit apiconnectcluster
        Change spec.management.multiSiteHA.mode to active.
    4. Run the following command on both data centers to confirm when management subsystem failover is complete:
      kubectl get mgmt
      Example output:
      production-mgmt   n/n   Running   10.0.x.y   10.0.x.y        Management is ready. HA status Ready - see HAStatus in CR for details
      Note: Failover can take 10 minutes or more to complete. If failover does not complete, you can check the API Connect operator pod log for errors (search for the word Multi).
    5. Update your dynamic router to redirect all traffic to DC2 instead of DC1. Until the DC2 site becomes active, the UIs might not be operational.
    6. Cloud Pak for Integration only: Check the APIConnectCluster CR status:
      oc get apiconnectcluster
      If the status returns Warning:
      NAME         READY   STATUS    VERSION    RECONCILED VERSION   MESSAGE                                      AGE
      production   4/5     Warning   10.0.8.0   10.0.8.0-7178        Warning - see status condition for details   7d9h
      then check the status conditions:
      oc get apiconnectcluster <apiconnectcluster cr name> -o json | jq -r '.status.conditions[] | select(.status=="True")'
      If the output shows the following condition:
      {
        "lastTransitionTime": "2024-05-29T18:15:16Z",
        "message": "an error occured while running job apic/production-configurator",
        "reason": "na",
        "status": "True",
        "type": "Warning"
      }
      then run the following command on your new active data center:
      oc -n <namespace> delete job -l app.kubernetes.io/component=configurator
      where <namespace> is the namespace of your APIConnectCluster.
  • Developer Portal subsystem failover

    In these example steps, DC1 is the active data center and DC2 is the warm-standby data center.

    If you have multiple Developer Portal services, you must repeat these steps for each Developer Portal subsystem that you want to failover.

    1. Set the DC1 Developer Portal subsystem to be warm-standby.
      • On Kubernetes and OpenShift individual subsystem CR deployments:
        Edit the DC1 PortalCluster CR:
        kubectl edit ptl
        Change multiSiteHA.mode to passive.
      • On Cloud Pak for Integration and OpenShift top-level CR deployments:
        Edit the DC1 APIConnectCluster CR:
        oc edit apiconnectcluster
        Change spec.portal.multiSiteHA.mode to passive.
      Note: If API Connect on DC1 is down, such that you can’t set it to be warm-standby, you must ensure that the network connectivity to and from the portal subsystem in DC1 is disabled, to prevent an accidental active-active situation if DC1 recovers unexpectedly. Do not restore network connectivity until DC1 is set to warm-standby.
    2. Run the following command on DC1 to check the Developer Portal subsystem status:
      kubectl get ptl -o yaml
      Continue to the next step when status.haMode is set to progressing to passive, or any of the passive states.
      For example:
      status:
        ...
        haMode:                            progressing to passive
        ...
    3. Change the DC2 Developer Portal subsystem from warm-standby to active.
      • On Kubernetes and OpenShift individual subsystem CR deployments:
        Edit the DC2 PortalCluster CR:
        kubectl edit ptl
        Change spec.multiSiteHA.mode to active.
      • On Cloud Pak for Integration and OpenShift top-level CR deployments:
        Edit the DC2 APIConnectCluster CR:
        oc edit apiconnectcluster 
        Change spec.portal.multiSiteHA.mode to active.
    4. Update your dynamic router to redirect all traffic to DC2 instead of DC1.
    5. Run the following command to check that the Developer Portal services on DC1 and DC2 are ready for traffic:
      kubectl get ptl -o yaml
      The services are ready for traffic when the status.haMode is set to passive on DC1, and is set to active on DC2.
  • Failover of both management and portal subsystems

    Failover the management subsystem first, followed by the portal subsystem.

    How long it takes to complete the failover varies, and depends on hardware speed, network latency, and the size of the databases. Approximate timings are:

    For the management subsystem:
    • warm-standby to active: 5 minutes
    • active to warm-standby: 15 minutes
    For the Developer Portal:
    • warm-standby to active: 15 - 40 minutes
    • active to warm-standby: 10 minutes

What to do next

If the original active data center was successfully converted to warm-standby, then verify that replication is working: Verifying replication between data centers. If replication is working, you can either:
  • Revert your 2DCDR deployment to the original active and warm-standby data center designations. To revert your deployment, follow the same failover steps in this topic.
  • Do nothing, and continue with your current active and warm-standby data center designations.

If your failed data center cannot be updated to warm-standby, then ensure that the network links to and from your management and portal subsystems in the failed data center are disabled. If the network links remain enabled, then an accidental active-active situation might occur if your failed data center recovers unexpectedly.

When you are able to recover the failed data center, ensure that the management and portal subsystems are set to warm-standby before you restore the network connectivity to prevent an active-active situation.
Note: The longer your original active data center was in a failed state, the longer data replication takes when it is recovered to warm-standby.

If you expect your failed data center to be down for a long time, then convert your active data center to a stand-alone deployment. See Removing a two data center deployment.