Key concepts of 2DCDR and failure scenarios
Understand how API Connect 2DCDR works and what to do when a data center fails.
This topic discusses the concepts around the operation and maintenance of an API Connect 2DCDR deployment. For information on the planning and installation of a 2DCDR deployment, see: Two data center warm-standby deployment on Kubernetes and OpenShift.
Failover
- Operational failover: Where your active API Connect subsystems and
data center are functioning normally, but you want to switch data centers for operational reasons,
or to test the failover process.
In this scenario, the active management and portal subsystems are converted to warm-standby first, and then the original warm-standby subsystems are converted to active.
- System down failover: Where your active data center, or the API Connect subsystems in
that data center are in a failed state, and you must failover to the warm-standby to restore your API Connect service.
In this scenario, the subsystems in the warm-standby data center are converted to active, and the network connectivity to and from the failed active subsystems is disabled - until it is possible to convert them to warm-standby.
- In all scenarios, an active-active configuration must be avoided. An active-active configuration is where the API Connect subsystems in both data centers are configured as active. This situation is commonly known as a split-brain. An active-active configuration means that the subsystem databases in each data center diverge from each other, and two management subsystems are both attempting to manage the other API Connect subsystems.
- When the active management subsystem is set to warm-standby, all data is deleted from its database. The management data from the new active (what was the warm-standby) is replicated to the new warm-standby data center. Do not set the active data center to warm-standby until you confirm that data replication is working, see: Verifying replication between data centers, or you have backups of your management data.
Reverting to a stand-alone deployment
- When API Connect is reverted to be a stand-alone deployment on the warm-standby data center, all management and portal data is deleted from that data center. It becomes an empty stand-alone deployment.
- When API Connect on the active data center is reverted to be a stand-alone deployment, all management and portal data remains.
Failure scenarios
- API Connect failure in the active data center.
- When the active data center is still working correctly, and still has connectivity to the warm-standby data center, but API Connect is not working correctly:
- It might be possible to recover API Connect on the active data center. Review the troubleshooting documentation:
- Follow the failover steps to make the warm-standby the active data center. If your failed data center cannot be changed to warm-standby, then disable network connectivity to and from it, to avoid an active-active situation if the data center recovers unexpectedly.
- Gather API Connect logs from the failed data center and open a support request: Gathering logs.
- Do not restore the network connectivity to and from the API Connect subsystems in the failed data center until they are set to warm-standby.
- Complete active data center failure
- When the active data center has a failure, and API Connect is not
accessible, follow these steps:
- Disable the network connections from and to your active subsystems. Network disablement is required to prevent an active-active scenario if your failed data center recovers before you can change it to warm-standby.
- Follow the failover steps to convert the warm-standby data center to active: Failing over to the warm-standby.
- Do not restore the network connectivity to and from the API Connect subsystems in the failed data center until they are set to warm-standby.
- Network failure that prevents API calls or user access to Management and Portal subsystems on active data center.
- In this scenario, the API Connect
2DCDR deployment is working
correctly, but unusable because the active data center is not accessible for API calls or for
Management and Portal UI/CLI/REST API users.
- Complete a failover, setting warm-standby to active, and the active to warm-standby: Failing over to the warm-standby.
- API Connect failure on the Warm-standby data center
- Where the API Connect pods on the warm-standby
data center are not running as expected.
- Review the troubleshooting documentation Troubleshooting two data center replication problems.
- Gather API Connect logs from the failed data center and open a support case: Gathering logs.
- Warm-standby data center failure
- The data center where API Connect was running as
warm-standby fails. In this
case when the data center is recovered:
- Verify that API Connect is running correctly as warm-standby and that the data from the active is replicated to it: Verifying replication between data centers.
- Network failure between data centers
- When the network connection between data centers is recovered, verify that 2DCDR database replication resumes: Verifying replication between data centers.
Management 2DCDR operational modes and statuses
status
section includes information on the current 2DCDR mode of the management
subsystem, and its status. The status section is included in the output
ofkubectl -n <namespace> get mgmt -o yaml
...
status:
...
haMode: active
...
haStatus:
Operational mode | Description |
---|---|
active |
The subsystem pods are ready and in the correct disaster recovery state for the active data center. |
passive |
The subsystem pods are ready and in the correct disaster recovery state for the warm-standby data center. |
standalone |
The management subsystem is not configured for 2DCDR. |
status.haStatus |
Description |
---|---|
Pending | 2DCDR initial state. |
Warning | Replication is failing. |
Ready | Replication is running. |
Error | Error in 2DCDR configuration. |
PeerNotReachable | The subsystem cannot contact the subsystem on the other site, and so replication is failing. |
BlockedWarmStandbyConversion | The warm-standby
subsystem is in Blocked state, waiting for other site to become active. This state
occurs when user converts existing warm-standby to active. |
PromotionBlocked | Current warm-standby site promotion to active is blocked |
ReadyForPromotion | Current warm-standby site is ready for promotion |
PGStateSetupBlocked |
Appears on active data center only. This status means that warm-standby did not initiate operator-level communication to the active yet. The active management subsystem is blocking incoming database-level connection attempts from the warm-standby. |
Portal 2DCDR operational states
kubectl get ptl -o yaml
...
status:
...
haMode: active
...
haStatus:
Operational state | Description |
---|---|
progressing to active |
Pods are progressing to the active state, but the subsystem is not capable of serving traffic. |
progressing to active (ready for traffic) |
At least one pod of each type is ready for traffic. The dynamic router can be linked to this service. |
active |
The subsystem pods are ready and in the correct disaster recovery state for the active data center. |
progressing to passive |
Pods are progressing to the warm-standby state. The subsystem is not capable of serving traffic. |
progressing to passive (ready for traffic) |
At least one pod of each type is ready for traffic. |
passive |
The subsystems are ready and in the correct disaster recovery state for the warm-standby data center. |
progressing to down |
Portal is preparing to do a factory reset. This happens to the warm-standby portal when the
multiSiteHA section is removed. |