Failure handling of a two data center deployment

How to detect failures, and perform service failover, in a two data center disaster recovery deployment on VMware.

Before you begin

Ensure that you understand the concepts of a 2DCDR deployment in API Connect. For more information, see A two data center deployment strategy on VMware.

About this task

Failure handling comprises of two main concepts; failure detection, and failover of services.

Failure detection is about detecting that a failure has happened, and then alerting the relevant administrator for them to take action. Failure detection is different from system monitoring. Failure detection checks are run often and fast (for example, every 5 seconds). As such they are normally more limited in what they check. In API Connect, failure checks detect only whether the database and web server are online, and nothing more. Whereas monitoring checks typically are run less frequently, but are more likely to check for things like CPU, memory, and disk space usage. The results of these checks can then be recorded to perform historical trend analysis to spot memory leaks for example.

If a failure is detected, it's then important to make sure that the cluster or service is definitely offline. So, for example, if data center one (DC1) appears to have an outage, and the plan is to failover to data center two (DC2), then it's important to make sure that the pods in DC1 really are offline. If the DC1 pods are actually still running, then they might continue to send data to DC2, which would lead DC2 to think it wasn't the active deployment, and then that could lead to split-brain. A fast way to ensure that the cluster or service really is offline, is to drop the network link to those pods in your VMware infrastructure, and then delete them.

The failure detection endpoint must be location specific and must be exempt from any traffic routing, so that it's possible to explicitly check the health of each data center regardless of how traffic is being routed.

You can failover a specific service, for example portal_service_1, or all of the services that are running in a particular data center. The following sections describe how failover is achieved.

The operational states of an API Manager service:

Operational state	Description
`progressing to active`	Pods are progressing to the active state, but none are capable of serving traffic.
`active`	All of the pods are ready and in the correct disaster recovery state for the active data center.
`progressing to passive`	Pods are progressing to the warm-standby state, but none are capable of serving traffic.
`passive`	All of the pods are ready and in the correct disaster recovery state for the warm-standby data center.

The operational states of a Developer Portal service:

Operational state	Description
`progressing to active`	Pods are progressing to the active state, but none are capable of serving traffic.
`progressing to active (ready for traffic)`	At least one pod of each type is ready for traffic. The dynamic router can be linked to this service.
`active`	All of the pods are ready and in the correct disaster recovery state for the active data center.
`progressing to passive`	Pods are progressing to the warm-standby state, but none are capable of serving traffic.
`progressing to passive (ready for traffic)`	At least one pod of each type is ready for traffic.
`passive`	All of the pods are ready and in the correct disaster recovery state for the warm-standby data center.
`progressing to down`	Pods are moving from warm-standby state to down.
`down`	The multi-site HA mode is deleted from the warm-standby data center.

You can run the following command to check the operational state of a service:

kubectl describe ServiceName

Where ServiceName is the name of the API Manager or Developer Portal service.

Important:

If the active data center is offline, do not change the multi-site-ha-enabled setting to false on the warm-standby data center, as this action deletes all the data from the databases. If you want to revert to a single data center topology, you must change the multi-site-ha-enabled setting to false on the active data center, and then redeploy the warm-standby data center. If the active data center is offline, you must first change the warm-standby data center to be active, and then change the multi-site-ha-enabled setting to false on the now active data center.
If the warm-standby data center is offline for more than 24 hours, there can be issues with the disk space on the active data center so you must revert your deployment to a single data center topology. To revert to a single data center topology, you must change the multi-site-ha-enabled setting to false on the active data center. When the warm-standby site has been redeployed, the multi-site-ha-enabled setting can be reapplied to the active site.

Procedure

API Manager service failover

To initiate a failover from DC1 to DC2 you first must set DC1 to warm-standby, before making the warm-standby data center active. This action is needed to prevent split-brain, as there cannot be two active services or clusters at the same time. If the failover is being done because the current active data center has had a complete outage, then you should ensure that the network links to that data center are dropped in your VMware infrastructure, in order to avoid the potential for split-brain occurring. The offline data center must be set to warm-standby before the network links are restored. The following instructions show how to failover DC1 to DC2 for the API Manager service.
1. Set DC1 (in this example, located in Dallas) to be warm-standby.
  Run apicup to change the multi-site-ha-mode property to passive on the Dallas data center, for example:
```
apicup subsys set mgmt_dallas multi-site-ha-mode=passive
```
  where
  - mgmt_dallas is the name of the management service on DC1.
  Note: If DC1 is completely down, and you cannot set it to be warm-standby at this point, you must ensure that the network links to DC1 are removed before continuing to set DC2 to be active. You must then not restore the network links to DC1 until you can set DC1 to be warm-standby.
2. Run apicup to update the settings on DC1, for example:
```
apicup subsys install mgmt_dallas
```
3. Check the status of DC1 management subsystem:
  Run the following command to get the 2DCDR status:
```
apicup subsys get <dc1-management> --validate | tr -s " " | grep multi-site-ha-mode
```
  if the status returned is passive (rather than progressing to passive), then include the --skip-health-check argument when you set DC2 to be active in step 4.
4. Set DC2 (in this example, located in Raleigh) to be active.
  Run apicup to change the multi-site-ha-mode property to active on the Raleigh data center:
```
apicup subsys set mgmt_raleigh multi-site-ha-mode=active <extra argument>
```
  where
  - mgmt_raleigh is the name of the management service on DC2.
  - <extra argument> set this to according to the status of DC1, that you checked in step 3.
5. Run apicup to update the settings on the services on DC2.
  For example:
```
apicup subsys install mgmt_raleigh
```
6. Update your dynamic router to redirect all traffic to DC2 instead of DC1. Note that until the DC2 site becomes active, the UIs might not be operational.
7. Log in to the virtual machine management subsystem on DC2 by using an SSH tool, and run the following command to check that the API Manager service is ready for traffic:
```
kubectl describe ServiceName
```
  The service is ready for traffic when the Ha mode part of the Status object is set to active. For example:
```
Status:
  ...
  Ha mode:                            active
  ...
```
Developer Portal service failover
To initiate a failover for the Developer Portal from DC1 to DC2 you first must ensure that the existing DC1 is warm-standby, before making the warm-standby data center active, in order to prevent split-brain. The following instructions show how to failover DC1 to DC2 for the Developer Portal service. You must repeat these steps for each Developer Portal service that you want to failover.
Note:
- Switching a Developer Portal service to warm-standby mode instantly restarts all of its portal-db and portal-www pods at the same time. Therefore, if you don't have an active Developer Portal service that is online, there will be an outage of the Developer Portal web sites until the warm-standby data center is ready to accept traffic.
- If you want to monitor the failover process, you can run the following command to check the status of the Portal pods in the PortalService CR file:
```
kubectl describe ServiceName
```
  For more details about the status information in a PortalCluster CR file, see the Example section.
- Do not edit the PortalService CR file during the failover process.
1. Set DC1 (in this example, located in Dallas) to be warm-standby.
  Run apicup to change the multi-site-ha-mode property to passive on the Dallas data center, for example:
```
apicup subsys set port_dallas multi-site-ha-mode=passive
```
  where
  - port_dallas is the name of the Portal service on DC1.
  Note: If DC1 is completely down, and you cannot set it to be warm-standby at this point, you must ensure that the network links to DC1 are removed before continuing to set DC2 to be active. You must then not restore the network links to DC1 until you can set DC1 to be warm-standby.
2. Run apicup to update the settings on DC1, for example:
```
apicup subsys install port_dallas --skip-health-check
```
  Important: You must not wait for the active DC Status to become warm-standby, before failing over the warm-standby DC to become active. You should change the DC1 multi-site-ha-mode to passive, and then run the kubectl describe ServiceName command, and as soon as the haMode is set to progressing to passive, you must change DC2 to be active.
3. Set DC2 (in this example, located in Raleigh) to be active.
  Run apicup to change the multi-site-ha-mode property to active on the Raleigh data center, for example:
```
apicup subsys set port_raleigh multi-site-ha-mode=active
```
  where
  - port_raleigh is the name of the Portal service on DC2.
4. Run apicup to update the settings on the services on DC2.
  For example:
```
apicup subsys install port_raleigh --skip-health-check
```
5. Update your dynamic router to redirect all traffic to DC2 instead of DC1.
6. Log in to the virtual machine management subsystem on DC2 by using an SSH tool, and run the following command to check that the Developer Portal service is ready for traffic:
```
kubectl describe ServiceName
```
  The service is ready for traffic when the haMode part of the Status object is set to active. For example:
```
Status:
  ...
  haMode:                            active
  ...
```
Entire data center failover

To failover an entire datacenter, follow the previous steps to failover the API Manager service, followed by the steps to failover each Developer Portal service in that data center.
How long it takes to complete the failover varies, and depends on hardware speed, network latency, and the size of the databases. However, here are some approximate timings:
For API Manager:
- warm-standby to active approximately 5 minutes
- active to warm-standby approximately 15 minutes
For Developer Portal:
- warm-standby to active 15 - 40 minutes
- active to warm-standby approximately 10 minutes

Example

Example of the status section of a PortalCluster CR file for an active data center that you can view by running the

kubectl describe
PortalService

command:


status:
  conditions:
  - lastTransitionTime: "2020-04-27T08:00:25Z"
    message: 4/6
    status: "False"
    type: Ready
  customImages: true
  dbCASecret: portal-db-ca
  encryptionSecret: ptl-encryption-key
  endpoints:
    portal-director: https://api.portal.ha-demo.cloud/
    portal-web: https://portal.ha-demo.cloud/
    replication: https://ptl-replication.cvs.apic-2ha-dallas-144307-xxxx14539d6e75f74e950db7077951d4-0000.us-south.containers.appdomain.cloud/
  localMultiSiteHAPeerConfigHash: UM9VtuWDdp/328n0oCl/GQGFzy34o1V9sH2OyXxKMHw= 
  haMode: progressing to active (ready for traffic)
  microServiceSecurity: custom
  multiSiteHA:
    dbPodScale: 3
    remoteSiteDeploymentName: portal
    remoteSiteName: frankfurt
    replicationPeer: https://ptl-replication.cvs.apic-2ha-frankfur-683198-xxxx14539d6e75f74e950db7077951d4-0000.eu-de.containers.appdomain.cloud/
    wwwPodScale: 3
  phase: 0
  serviceCASecret: portal-ca
  serviceClientSecret: portal-client
  serviceServerSecret: portal-server
  services:
    db: portal-dallas-db
    db-remote: portal-frankfurt-db
    nginx: portal-nginx
    tunnel: portal-tunnel
    www: portal-dallas-www
    www-remote: portal-frankfurt-www
  versions:
    available:
      channels:
      - name: "10"
      - name: "10.0"
      - name: 10.0.0
      - name: 10.0.0.0
      versions:
      - name: 10.0.0.0-1038
    reconciled: 10.0.0.0-1038

Meaning of the haMode for the active data center:

active - this is the active data center, or the only data center, and all pods are ready and accepting traffic.
progressing to active - this is the active data center, or the only data center, and there are no pods ready yet.
progressing to active (ready for traffic) - this is the active data center, or the only data center, and there is at least one pod of each type, www, db, and nginx, ready, so the data center is accepting traffic.

Example of the status section of a PortalCluster CR file for a warm-standby data center that you can view by running the

kubectl describe
PortalService

command:


status:
  conditions:
  - lastTransitionTime: "2020-04-21T10:54:54Z"
    message: 6/6
    status: "True"
    type: Ready
  customImages: true
  dbCASecret: portal-db-ca
  encryptionSecret: ptl-encryption-key
  endpoints:
    portal-director: https://api.portal.ha-demo.cloud/
    portal-web: https://portal.ha-demo.cloud/
    replication: https://ptl-replication.cvs.apic-2ha-frankfur-683198-xxxx14539d6e75f74e950db7077951d4-0000.eu-de.containers.appdomain.cloud/
  localMultiSiteHAPeerConfigHash: VguODE74TkiS3LCc5ytQiaF8100PXMHUrVBtb+PbKOg=
  haMode: progressing to passive (ready for traffic)
  microServiceSecurity: custom
  multiSiteHA:
    dbPodScale: 3
    remoteSiteDeploymentName: portal
    remoteSiteName: dallas
    replicationPeer: https://ptl-replication.cvs.apic-2ha-dallas-144307-xxxx14539d6e75f74e950db7077951d4-0000.us-south.containers.appdomain.cloud/
    wwwPodScale: 3
  phase: 0
  serviceCASecret: portal-ca
  serviceClientSecret: portal-client
  serviceServerSecret: portal-server
  services:
    db: portal-frankfurt-db
    db-remote: portal-dallas-db
    nginx: portal-nginx
    tunnel: portal-tunnel
    www: portal-frankfurt-www
    www-remote: portal-dallas-www
  versions:
    available:
      channels:
      - name: "10"
      - name: "10.0"
      - name: 10.0.0
      - name: 10.0.0.0
      versions:
      - name: 10.0.0.0-1038
    reconciled: 10.0.0.0-1038

Meaning of the haMode for the warm-standby data center:

passive - this is the warm-standby data center, and all pods are ready and accepting traffic.
progressing to passive - this is the warm-standby data center, and there are no pods ready yet.
progressing to passive (ready for traffic) - this is the warm-standby data center, and there is at least one pod of each type, www, db, and nginx, ready, so the data center is accepting traffic.

What to do next

As soon as a failure on a data center has been resolved, the failed data center should be brought back online and re-linked to the currently active data center in order to maintain the highest availability; see Recovering from a failover of a two data center deployment for more information.