Key concepts of 2DCDR and failure scenarios

Understand how API Connect 2DCDR works and what to do when a data center fails.

This topic discusses the concepts around the operation and maintenance of an existing API Connect 2DCDR deployment. For information on the planning and installation of a 2DCDR deployment, see: Two data center deployment strategy on Kubernetes and OpenShift.

Failover of services and reverting to a stand-alone deployment

Failover means the API Connect management and portal subsystems on the warm-standby data center, become the active management and portal subsystems. Whether it is possible to convert the original active data center to be the warm-standby at the same time depends on the problems the active data center has that required the failover. However, the failed active data center must be set to warm-standby before you re-establish its connectivity to the new active data center.
Important: When the active data center is set to warm-standby, all data is deleted from its management database. The management data from the new active (what was the warm-standby) is replicated to the new warm-standby data center. Do not set the active data center to warm-standby until you confirm that data replication is working, see: Verifying replication between data centers, or you have backups of your management data.
If you want to revert to a stand-alone API Connect deployment, before you start the process, ensure that the data center where you choose to keep API Connect is set to active.
  • When API Connect is reverted to be a stand-alone deployment on the warm-standby data center, all management and portal data is deleted from that data center. It becomes an empty stand-alone deployment.
  • When API Connect on the active data center is reverted to be a stand-alone deployment, all management and portal data remains.

Failure scenarios

The action that you should take when a failure occurs depends on the type of failure and which data center the failure occurs in:
API Connect failure in the active data center.
When the active data center is still working correctly, and still has connectivity to the warm-standby data center, but API Connect is not working correctly:
  1. It might be possible to recover API Connect on the active data center. Review the troubleshooting documentation:
  2. Follow the failover steps to make the warm-standby the active data center. If your failed active data center cannot be changed to warm-standby, then disable the network link used by API Connect between the data centers.
  3. Gather API Connect post-mortem logs from the failed data center and open a support request: Gathering post-mortem logs.
  4. Do not restore the network connection between data centers until API Connect on the failed data center is recovered and set to warm-standby.
Complete active data center failure
When the active data center has a failure, and API Connect is not accessible, follow these steps:
  1. Disable the network connection between the active and the warm-standby data center.
  2. Follow the failover steps to make the warm-standby the active data center: How to failover API Connect from the active to the warm-standby data center.
  3. Do not restore the network connection between data centers until API Connect on the failed data center is recovered and set to warm-standby.
Network failure that prevents API calls or user access to Management and Portal subsystems on active data center.
In this scenario, the API Connect 2DCDR deployment is working correctly, but unusable because the active data center is not accessible for API calls or for Management and Portal UI/CLI/REST API users.
  1. Complete a failover, setting warm-standby to active, and the active to warm-standby: How to failover API Connect from the active to the warm-standby data center.
API Connect failure on the Warm-standby data center
Where the API Connect pods on the warm-standby data center are not running as expected.
  1. Review the troubleshooting documentation Troubleshooting two data center replication problems.
  2. Gather API Connect post-mortem logs from the failed data center and open a support case: Gathering post-mortem logs.
Warm-standby data center failure
The data center where API Connect was running as warm-standby fails. In this case when the data center is recovered:
  1. Verify that API Connect is running correctly as warm-standby and that the data from the active is replicated to it:Verifying replication between data centers.
Network failure between data centers
When the network connection between data centers is recovered, verify that 2DCDR database replication resumes: Verifying replication between data centers.

Operational states

API Connect has various transitory states that it passes through during startup and failover. The current 2DCDR operational state can be observed from the CR status section, for example:
kubectl describe mgmt -n <namespace>

...
Status:
...
  Ha Mode: active
Table 1. 2DCDR Management subsystem operational states
Operational state Description
progressing to active Pods are progressing to the active state, but the subsystem is not capable of serving traffic.
active All of the pods are ready and in the correct disaster recovery state for the active data center.
progressing to passive Pods are progressing to the warm-standby state. The subsystem is not capable of serving traffic.
passive All of the pods are ready and in the correct disaster recovery state for the warm-standby data center.
Table 2. 2DCDR Portal subsystem operational states
Operational state Description
progressing to active Pods are progressing to the active state, but the subsystem is not capable of serving traffic.
progressing to active (ready for traffic) At least one pod of each type is ready for traffic. The dynamic router can be linked to this service.
active All of the pods are ready and in the correct disaster recovery state for the active data center.
progressing to passive Pods are progressing to the warm-standby state. The subsystem is not capable of serving traffic.
progressing to passive (ready for traffic) At least one pod of each type is ready for traffic.
passive All of the pods are ready and in the correct disaster recovery state for the warm-standby data center.
progressing to down Portal is preparing to do a factory reset. This is what happens to a warm-standby portal data center when the multiSiteHA section is removed.