Key concepts of 2DCDR and failure scenarios

Understand how API Connect 2DCDR works and what to do when a data center fails.

This topic discusses the concepts around the operation and maintenance of an API Connect 2DCDR deployment. For information on the planning and installation of a 2DCDR deployment, see: Two data center warm-standby deployment on Kubernetes and OpenShift.

Failover

Failover is a manual operation that requires updating the active management, portal, or both subsystems to become warm-standby, and the warm-standby subsystems to become active. The typical situations for completing a failover operation are:
  • Operational failover: Where your active API Connect subsystems and data center are functioning normally, but you want to switch data centers for operational reasons, or to test the failover process.

    In this scenario, the active management and portal subsystems are converted to warm-standby first, and then the original warm-standby subsystems are converted to active.

  • System down failover: Where your active data center, or the API Connect subsystems in that data center are in a failed state, and you must failover to the warm-standby to restore your API Connect service.

    In this scenario, the subsystems in the warm-standby data center are converted to active, and the network connectivity to and from the failed active subsystems is disabled - until it is possible to convert them to warm-standby.

Important:
  • In all scenarios, an active-active configuration must be avoided. An active-active configuration is where the API Connect subsystems in both data centers are configured as active. This situation is commonly known as a split-brain. An active-active configuration means that the subsystem databases in each data center diverge from each other, and two management subsystems are both attempting to manage the other API Connect subsystems.
  • When the active management subsystem is set to warm-standby, all data is deleted from its database. The management data from the new active (what was the warm-standby) is replicated to the new warm-standby data center. Do not set the active data center to warm-standby until you confirm that data replication is working, see: Verifying replication between data centers, or you have backups of your management data.

Reverting to a stand-alone deployment

If you want to revert to a stand-alone API Connect deployment, before you start the process, ensure that the data center where you choose to keep API Connect is set to active.
  • When API Connect is reverted to be a stand-alone deployment on the warm-standby data center, all management and portal data is deleted from that data center. It becomes an empty stand-alone deployment.
  • When API Connect on the active data center is reverted to be a stand-alone deployment, all management and portal data remains.

Failure scenarios

The action that you should take when a failure occurs depends on the type of failure and which data center the failure occurs in:
API Connect failure in the active data center.
When the active data center is still working correctly, and still has connectivity to the warm-standby data center, but API Connect is not working correctly:
  1. It might be possible to recover API Connect on the active data center. Review the troubleshooting documentation:
  2. Follow the failover steps to make the warm-standby the active data center. If your failed data center cannot be changed to warm-standby, then disable network connectivity to and from it, to avoid an active-active situation if the data center recovers unexpectedly.
  3. Gather API Connect logs from the failed data center and open a support request: Gathering logs.
  4. Do not restore the network connectivity to and from the API Connect subsystems in the failed data center until they are set to warm-standby.
Complete active data center failure
When the active data center has a failure, and API Connect is not accessible, follow these steps:
  1. Disable the network connections from and to your active subsystems. Network disablement is required to prevent an active-active scenario if your failed data center recovers before you can change it to warm-standby.
  2. Follow the failover steps to convert the warm-standby data center to active: Failing over to the warm-standby.
  3. Do not restore the network connectivity to and from the API Connect subsystems in the failed data center until they are set to warm-standby.
Network failure that prevents API calls or user access to Management and Portal subsystems on active data center.
In this scenario, the API Connect 2DCDR deployment is working correctly, but unusable because the active data center is not accessible for API calls or for Management and Portal UI/CLI/REST API users.
  1. Complete a failover, setting warm-standby to active, and the active to warm-standby: Failing over to the warm-standby.
API Connect failure on the Warm-standby data center
Where the API Connect pods on the warm-standby data center are not running as expected.
  1. Review the troubleshooting documentation Troubleshooting two data center replication problems.
  2. Gather API Connect logs from the failed data center and open a support case: Gathering logs.
Warm-standby data center failure
The data center where API Connect was running as warm-standby fails. In this case when the data center is recovered:
  1. Verify that API Connect is running correctly as warm-standby and that the data from the active is replicated to it: Verifying replication between data centers.
Network failure between data centers
When the network connection between data centers is recovered, verify that 2DCDR database replication resumes: Verifying replication between data centers.

Management 2DCDR operational modes and statuses

The management CR status section includes information on the current 2DCDR mode of the management subsystem, and its status. The status section is included in the output of
kubectl -n <namespace> get mgmt -o yaml
...
status:
...
  haMode: active
  ...
  haStatus: 
Table 1. 2DCDR Management subsystem operational modes (status.haMode)
Operational mode Description
active The subsystem pods are ready and in the correct disaster recovery state for the active data center.
passive The subsystem pods are ready and in the correct disaster recovery state for the warm-standby data center.
standalone The management subsystem is not configured for 2DCDR.
Table 2. Management subsystem 2DCDR statuses (status.haStatus)
status.haStatus Description
Pending 2DCDR initial state.
Warning Replication is failing.
Ready Replication is running.
Error Error in 2DCDR configuration.
PeerNotReachable The subsystem cannot contact the subsystem on the other site, and so replication is failing.
BlockedWarmStandbyConversion The warm-standby subsystem is in Blocked state, waiting for other site to become active. This state occurs when user converts existing warm-standby to active.
PromotionBlocked Current warm-standby site promotion to active is blocked
ReadyForPromotion Current warm-standby site is ready for promotion
PGStateSetupBlocked

Appears on active data center only. This status means that warm-standby did not initiate operator-level communication to the active yet. The active management subsystem is blocking incoming database-level connection attempts from the warm-standby.

Portal 2DCDR operational states

The portal subsystem has various transitory states that it passes through during startup and failover. The current 2DCDR operational state can be observed from the CR status section, for example:
kubectl get ptl -o yaml
...
status:
...
  haMode: active
  ...
  haStatus: 
Table 3. 2DCDR Portal subsystem operational states
Operational state Description
progressing to active Pods are progressing to the active state, but the subsystem is not capable of serving traffic.
progressing to active (ready for traffic) At least one pod of each type is ready for traffic. The dynamic router can be linked to this service.
active The subsystem pods are ready and in the correct disaster recovery state for the active data center.
progressing to passive Pods are progressing to the warm-standby state. The subsystem is not capable of serving traffic.
progressing to passive (ready for traffic) At least one pod of each type is ready for traffic.
passive The subsystems are ready and in the correct disaster recovery state for the warm-standby data center.
progressing to down Portal is preparing to do a factory reset. This happens to the warm-standby portal when the multiSiteHA section is removed.