Failure handling of a two data center deployment
How to detect failures, and perform service failover, in a two data center disaster recovery deployment on Kubernetes and OpenShift.
Before you begin
Ensure that you understand the concepts of two data center disaster recovery in API Connect. For more information, see A two data center deployment strategy on Kubernetes and OpenShift.
About this task
Failure detection is about detecting that a failure has happened, and then alerting the relevant administrator for them to take action. Failure detection is different from system monitoring. Failure detection checks are run often and fast (for example, every 5 seconds). As such they're normally more limited in what they check. In API Connect, failure checks detect only whether the database and web server are online, and nothing more. Whereas monitoring checks typically are run less frequently, but are more likely to check for things such as CPU, memory, and disk space usage. The results of these checks can, for example, be recorded to perform historical trend analysis to spot memory leaks.
If a failure is detected by the system administrator, or any automated monitoring that you are running, it's important to make sure that the cluster or service is definitely offline. So, for example, if data center one (DC1) appears to have an outage, and the plan is to failover to data center two (DC2), then it's important to make sure that the pods in DC1 really are offline. If the DC1 pods are still running, then they might continue to send data to DC2, which would lead DC2 to think it wasn't the active deployment, and then that might lead to split-brain. A fast way to ensure that the cluster or service really is offline, is to drop the network link to those pods in your Kubernetes infrastructure, and then delete them.
The failure detection endpoint must be location-specific and must be exempt from any traffic routing, so that it's possible to explicitly check the health of each data center regardless of how traffic is being routed.
You can failover a specific service, for example
portal_service_1
, or all of the services that are running in a particular data
center. The following sections describe how failover is achieved.
Operational state | Description |
---|---|
progressing to active |
Pods are progressing to the active state, but none are capable of serving traffic. |
active |
All of the pods are ready and in the correct disaster recovery state for the active data center. |
progressing to passive |
Pods are progressing to the passive state, but none are capable of serving traffic. |
passive |
All of the pods are ready and in the correct disaster recovery state for the passive data center. |
Operational state | Description |
---|---|
progressing to active |
Pods are progressing to the active state, but none are capable of serving traffic. |
progressing to active (ready for traffic) |
At least one pod of each type is ready for traffic. The dynamic router can be linked to this service. |
active |
All of the pods are ready and in the correct disaster recovery state for the active data center. |
progressing to passive |
Pods are progressing to the passive state, but none are capable of serving traffic. |
progressing to passive (ready for traffic) |
At least one pod of each type is ready for traffic. |
passive |
All of the pods are ready and in the correct disaster recovery state for the passive data center. |
progressing to down |
Pods are moving from passive state to down. |
down |
The multi-site HA mode is deleted from the passive data center. |
kubectl describe ServiceName
Where
ServiceName
is the name of the API Manager or Developer Portal
service.- If the active data center is offline, do not remove the multi-site HA CR section from the passive data center, as this action deletes all the data from the databases. If you want to revert to a single data center topology, you must remove the multi-site HA CR section from the active data center, and then redeploy the passive data center. If the active data center is offline, you must first change the passive data center to be active, and then remove the multi-site HA CR section from the now active data center. The passive site must be uninstalled and redeployed as a clean install before it can be used. For more information, see Removing a two data center deployment.
- If the passive data center is offline for more than 24 hours, there can be issues with the disk space on the active data center so you must revert your deployment to a single data center topology. To revert to a single data center topology, you must remove the multi-site HA CR section from the active data center. When the passive site has been redeployed, the multi-site HA CR can be reapplied to the active site. For more information, see Removing a two data center deployment.