Failure handling of a two data center deployment
How to detect failures, and perform service failover, in a two data center disaster recovery deployment on VMware.
Before you begin
Ensure that you understand the concepts of a two data center disaster recovery deployment in API Connect. For more information, see A two data center deployment strategy on VMware.
About this task
Failure detection is about detecting that a failure has happened, and then alerting the relevant administrator for them to take action. Failure detection is different from system monitoring. Failure detection checks are run often and fast (for example, every 5 seconds). As such they are normally more limited in what they check. In API Connect, failure checks detect only whether the database and web server are online, and nothing more. Whereas monitoring checks typically are run less frequently, but are more likely to check for things like CPU, memory, and disk space usage. The results of these checks can then be recorded to perform historical trend analysis to spot memory leaks for example.
If a failure is detected by the system administrator, or any automated monitoring that you are running, it's important to make sure that the cluster or service is definitely offline. So, for example, if data center one (DC1) appears to have an outage, and the plan is to failover to data center two (DC2), then it's important to make sure that the pods in DC1 really are offline. If the DC1 pods are actually still running, then they might continue to send data to DC2, which would lead DC2 to think it wasn't the active deployment, and then that could lead to split-brain. A fast way to ensure that the cluster or service really is offline, is to drop the network link to those pods in your VMware infrastructure, and then delete them.
The failure detection endpoint must be location specific and must be exempt from any traffic routing, so that it's possible to explicitly check the health of each data center regardless of how traffic is being routed.
You can failover a specific service, for example
portal_service_1
, or all of the services that are running in a particular data
center. The following sections describe how failover is achieved.
Operational state | Description |
---|---|
progressing to active |
Pods are progressing to the active state, but none are capable of serving traffic. |
active |
All of the pods are ready and in the correct disaster recovery state for the active data center. |
progressing to passive |
Pods are progressing to the passive state, but none are capable of serving traffic. |
passive |
All of the pods are ready and in the correct disaster recovery state for the passive data center. |
Operational state | Description |
---|---|
progressing to active |
Pods are progressing to the active state, but none are capable of serving traffic. |
progressing to active (ready for traffic) |
At least one pod of each type is ready for traffic. The dynamic router can be linked to this service. |
active |
All of the pods are ready and in the correct disaster recovery state for the active data center. |
progressing to passive |
Pods are progressing to the passive state, but none are capable of serving traffic. |
progressing to passive (ready for traffic) |
At least one pod of each type is ready for traffic. |
passive |
All of the pods are ready and in the correct disaster recovery state for the passive data center. |
progressing to down |
Pods are moving from passive state to down. |
down |
The multi-site HA mode is deleted from the passive data center. |
kubectl describe ServiceName
Where
ServiceName
is the name of the API Manager or Developer Portal
service.- If the active data center is offline, do not change the
multi-site-ha-enabled
setting tofalse
on the passive data center, as this action deletes all the data from the databases. If you want to revert to a single data center topology, you must change themulti-site-ha-enabled
setting tofalse
on the active data center, and then redeploy the passive data center. If the active data center is offline, you must first change the passive data center to be active, and then change themulti-site-ha-enabled
setting tofalse
on the now active data center. - If the passive data center is offline for more than 24 hours, there can
be issues with the disk space on the active data center so you must revert your deployment to a
single data center topology. To revert to a single data center topology, you must change the
multi-site-ha-enabled
setting tofalse
on the active data center. When the passive site has been redeployed, themulti-site-ha-enabled
setting can be reapplied to the active site.
Procedure
Example
PortalCluster
CR file for
an active data center that you can view by running the kubectl describe
PortalService
command:
status:
conditions:
- lastTransitionTime: "2020-04-27T08:00:25Z"
message: 4/6
status: "False"
type: Ready
customImages: true
dbCASecret: portal-db-ca
encryptionSecret: ptl-encryption-key
endpoints:
portal-director: https://api.portal.ha-demo.cloud/
portal-web: https://portal.ha-demo.cloud/
replication: https://ptl-replication.cvs.apic-2ha-dallas-144307-xxxx14539d6e75f74e950db7077951d4-0000.us-south.containers.appdomain.cloud/
localMultiSiteHAPeerConfigHash: UM9VtuWDdp/328n0oCl/GQGFzy34o1V9sH2OyXxKMHw=
haMode: progressing to active (ready for traffic)
microServiceSecurity: custom
multiSiteHA:
dbPodScale: 3
remoteSiteDeploymentName: portal
remoteSiteName: frankfurt
replicationPeer: https://ptl-replication.cvs.apic-2ha-frankfur-683198-xxxx14539d6e75f74e950db7077951d4-0000.eu-de.containers.appdomain.cloud/
wwwPodScale: 3
phase: 0
serviceCASecret: portal-ca
serviceClientSecret: portal-client
serviceServerSecret: portal-server
services:
db: portal-dallas-db
db-remote: portal-frankfurt-db
nginx: portal-nginx
tunnel: portal-tunnel
www: portal-dallas-www
www-remote: portal-frankfurt-www
versions:
available:
channels:
- name: "10"
- name: "10.0"
- name: 10.0.0
- name: 10.0.0.0
versions:
- name: 10.0.0.0-1038
reconciled: 10.0.0.0-1038
haMode
for the active data center:active
- this is the active data center, or the only data center, and all pods are ready and accepting traffic.progressing to active
- this is the active data center, or the only data center, and there are no pods ready yet.progressing to active (ready for traffic)
- this is the active data center, or the only data center, and there is at least one pod of each type, www, db, and nginx, ready, so the data center is accepting traffic.
PortalCluster
CR file for a passive
data center that you can view by running the kubectl describe
PortalService
command:
status:
conditions:
- lastTransitionTime: "2020-04-21T10:54:54Z"
message: 6/6
status: "True"
type: Ready
customImages: true
dbCASecret: portal-db-ca
encryptionSecret: ptl-encryption-key
endpoints:
portal-director: https://api.portal.ha-demo.cloud/
portal-web: https://portal.ha-demo.cloud/
replication: https://ptl-replication.cvs.apic-2ha-frankfur-683198-xxxx14539d6e75f74e950db7077951d4-0000.eu-de.containers.appdomain.cloud/
localMultiSiteHAPeerConfigHash: VguODE74TkiS3LCc5ytQiaF8100PXMHUrVBtb+PbKOg=
haMode: progressing to passive (ready for traffic)
microServiceSecurity: custom
multiSiteHA:
dbPodScale: 3
remoteSiteDeploymentName: portal
remoteSiteName: dallas
replicationPeer: https://ptl-replication.cvs.apic-2ha-dallas-144307-xxxx14539d6e75f74e950db7077951d4-0000.us-south.containers.appdomain.cloud/
wwwPodScale: 3
phase: 0
serviceCASecret: portal-ca
serviceClientSecret: portal-client
serviceServerSecret: portal-server
services:
db: portal-frankfurt-db
db-remote: portal-dallas-db
nginx: portal-nginx
tunnel: portal-tunnel
www: portal-frankfurt-www
www-remote: portal-dallas-www
versions:
available:
channels:
- name: "10"
- name: "10.0"
- name: 10.0.0
- name: 10.0.0.0
versions:
- name: 10.0.0.0-1038
reconciled: 10.0.0.0-1038
haMode
for the passive data center:passive
- this is the passive data center, and all pods are ready and accepting traffic.progressing to passive
- this is the passive data center, and there are no pods ready yet.progressing to passive (ready for traffic)
- this is the passive data center, and there is at least one pod of each type, www, db, and nginx, ready, so the data center is accepting traffic.