Failure handling of a two data center deployment

How to detect failures, and perform service failover, in a two data center disaster recovery deployment on Kubernetes and OpenShift.

Before you begin

Ensure that you understand the concepts of two data center disaster recovery in API Connect. For more information, see A two data center deployment strategy on Kubernetes and OpenShift.

About this task

Failure handling consists of two main concepts; failure detection, and failover of services.

Failure detection is about detecting that a failure has happened, and then alerting the relevant administrator for them to take action. Failure detection is different from system monitoring. Failure detection checks are run often and fast (for example, every 5 seconds). As such they're normally more limited in what they check. In API Connect, failure checks detect only whether the database and web server are online, and nothing more. Whereas monitoring checks typically are run less frequently, but are more likely to check for things such as CPU, memory, and disk space usage. The results of these checks can, for example, be recorded to perform historical trend analysis to spot memory leaks.

If a failure is detected by the system administrator, or any automated monitoring that you are running, it's important to make sure that the cluster or service is definitely offline. So, for example, if data center one (DC1) appears to have an outage, and the plan is to failover to data center two (DC2), then it's important to make sure that the pods in DC1 really are offline. If the DC1 pods are still running, then they might continue to send data to DC2, which would lead DC2 to think it wasn't the active deployment, and then that might lead to split-brain. A fast way to ensure that the cluster or service really is offline, is to drop the network link to those pods in your Kubernetes infrastructure, and then delete them.

The failure detection endpoint must be location-specific and must be exempt from any traffic routing, so that it's possible to explicitly check the health of each data center regardless of how traffic is being routed.

You can failover a specific service, for example portal_service_1, or all of the services that are running in a particular data center. The following sections describe how failover is achieved.

The operational states of an API Manager service:

Operational state	Description
`progressing to active`	Pods are progressing to the active state, but none are capable of serving traffic.
`active`	All of the pods are ready and in the correct disaster recovery state for the active data center.
`progressing to passive`	Pods are progressing to the passive state, but none are capable of serving traffic.
`passive`	All of the pods are ready and in the correct disaster recovery state for the passive data center.

The operational states of a Developer Portal service:

Operational state	Description
`progressing to active`	Pods are progressing to the active state, but none are capable of serving traffic.
`progressing to active (ready for traffic)`	At least one pod of each type is ready for traffic. The dynamic router can be linked to this service.
`active`	All of the pods are ready and in the correct disaster recovery state for the active data center.
`progressing to passive`	Pods are progressing to the passive state, but none are capable of serving traffic.
`progressing to passive (ready for traffic)`	At least one pod of each type is ready for traffic.
`passive`	All of the pods are ready and in the correct disaster recovery state for the passive data center.
`progressing to down`	Pods are moving from passive state to down.
`down`	The multi-site HA mode is deleted from the passive data center.

You can run the following command to check the operational state of a service:

kubectl describe ServiceName

Where ServiceName is the name of the API Manager or Developer Portal service.

Important:

If the active data center is offline, do not remove the multi-site HA CR section from the passive data center, as this action deletes all the data from the databases. If you want to revert to a single data center topology, you must remove the multi-site HA CR section from the active data center, and then redeploy the passive data center. If the active data center is offline, you must first change the passive data center to be active, and then remove the multi-site HA CR section from the now active data center. The passive site must be uninstalled and redeployed as a clean install before it can be used. For more information, see Removing a two data center deployment.
If the passive data center is offline for more than 24 hours, there can be issues with the disk space on the active data center so you must revert your deployment to a single data center topology. To revert to a single data center topology, you must remove the multi-site HA CR section from the active data center. When the passive site has been redeployed, the multi-site HA CR can be reapplied to the active site. For more information, see Removing a two data center deployment.

Procedure

API Manager service failover

To initiate a failover from DC1 to DC2 you first must set DC1 to passive, before making the passive data center active. This action is needed to prevent split-brain, as there can’t be two active services or clusters at the same time. If the failover is being done because the current active data center has had a complete outage, then you should ensure that the network links to that data center are dropped in your Kubernetes infrastructure, in order to avoid the potential for split-brain occurring. The offline data center must be set to passive before the network links are restored. The following instructions show how to failover DC1 to DC2 for the API Manager service.
1. Set the DC1 API Manager service (in this example, located in Dallas), to be passive.
  Edit the DC1 ManagementCluster custom resource (CR) file management_cr by running the following command:
```
kubectl edit mgmt <mgmt-cluster-name> -n <namespace>
```
  and change the mode to passive:
```
multiSiteHA:
  mode: passive
```
  Where:
  - <mgmt-cluster-name> is the name specified in the subsystem CR at installation time. You can check the name by running kubectl get ManagementCluster -n <namespace>.
  - <namespace> is the target installation namespace in the Kubernetes cluster.
  Note: If DC1 is completely down, and you can’t set it to be passive at this point, you must ensure that the network links to DC1 are removed before continuing to set DC2 to be active. You must then not restore the network links to DC1 until you can set DC1 to be passive.
2. Run the following command on DC1 to check that the API Manager service is ready for traffic:
```
kubectl describe <mgmt-cluster-name> -n <namespace>
```
  The service is ready for traffic when the Ha mode part of the Status object is set to passive. For example:
```
Status:
  ...
  Ha mode:                            passive
  ...
```
3. Change the DC2 API Manager service (in this example, located in Raleigh), from passive to active.
  Edit the DC2 ManagementCluster custom resource (CR) file management_cr by running the following command:
```
kubectl edit mgmt <mgmt-cluster-name> -n <namespace>
```
  and change the mode to active:
```
multiSiteHA:
  mode: active
```
  Where:
  - <mgmt-cluster-name> is the name specified in the subsystem CR at installation time. You can check the name by running kubectl get ManagementCluster -n <namespace>.
  - <namespace> is the target installation namespace in the Kubernetes cluster.
4. Run the following command on DC2 to check that the API Manager services are ready for traffic:
```
kubectl describe <mgmt-cluster-name> -n <namespace>
```
  The services are ready for traffic when the Ha mode part of the Status object is set on DC2 to active. For example:
```
Status:
  ...
  Ha mode:                            active
  ...
```
5. Update your dynamic router to redirect all traffic to DC2 instead of DC1. Note that until the DC2 site becomes active, the UIs might not be operational.
Developer Portal service failover
To initiate a failover for the Developer Portal from DC1 to DC2 you first must ensure that the existing DC1 is passive, before making the DC2 data center active, in order to prevent split-brain. The following instructions show how to failover DC1 to DC2 for the Developer Portal service. You must repeat these steps for each Developer Portal service that you want to failover.
Note:
- Switching a Developer Portal service to passive mode instantly restarts all of its portal-db and portal-www pods at the same time. Therefore, if you don't have an active Developer Portal service that is online, there will be an outage of the Developer Portal web sites until the passive data center is ready to accept traffic.
- If you want to monitor the failover process, you can run the following command to check the status of the Portal pods in the PortalService CR file:
```
kubectl describe ServiceName
```
- Do not edit the PortalService CR file during the failover process.
1. Set the DC1 Developer Portal service (in this example, located in Dallas) to be passive.
  Edit the DC1 PortalCluster custom resource (CR) file portal_cr by running the following command:
```
kubectl edit ptl <ptl-cluster-name> -n <namespace>
```
  and change the mode to passive:
```
multiSiteHA:
  mode: passive
```
  Where:
  - <ptl-cluster-name> is the name specified in the subsystem CR at installation time. You can check the name by running kubectl get PortalCluster -n <namespace>.
  - <namespace> is the target installation namespace in the Kubernetes cluster.
  Note: If DC1 is completely down, and you can’t set it to be passive at this point, you must ensure that the network links to DC1 are removed before continuing to set DC2 to be active. You must then not restore the network links to DC1 until you can set DC1 to be passive.
2. Run the following command on DC1 to check the Developer Portal service status:
```
kubectl describe <ptl-cluster-name> -n <namespace>
```
  You can continue to the next step when the Ha mode part of the Status object is set to progressing to passive, or any of the passive states. For example:
```
Status:
  ...
  Ha mode:                            progressing to passive
  ...
```
  Important: You must not wait for the active DC Status to become passive, before failing over the passive DC to become active. You should change the DC1 mode to passive, and then run the kubectl describe ServiceName command, and as soon as the Ha mode is set to progressing to passive, you must change DC2 to be active.
3. Change the DC2 Developer Portal service (in this example, located in Raleigh), from passive to active.
  Edit the DC2 PortalCluster custom resource (CR) file portal_cr by running the following command:
```
kubectl edit ptl <ptl-cluster-name> -n <namespace>
```
  and change the mode to active:
```
multiSiteHA:
  mode: active
```
  Where:
  - <ptl-cluster-name> is the name specified in the subsystem CR at installation time. You can check the name by running kubectl get PortalCluster -n <namespace>.
  - <namespace> is the target installation namespace in the Kubernetes cluster.
4. Update your dynamic router to redirect all traffic to DC2 instead of DC1.
5. Run the following command to check that the Developer Portal services on DC1 and DC2 are ready for traffic:
```
kubectl describe <ptl-cluster-name> -n <namespace>
```
  The services are ready for traffic when the Ha mode part of the Status object is set on DC1 to passive, and is set on DC2 to active. For example, on DC2:
```
Status:
  ...
  Ha mode:                            active
  ...
```
Entire data center failover

To failover an entire data center, follow the previous steps to failover the API Manager service, followed by the steps to failover each Developer Portal service in that data center.
How long it takes to complete the failover varies, and depends on hardware speed, network latency, and the size of the databases. However, here are some approximate timings:
For API Manager:
- passive to active approximately 5 minutes
- active to passive approximately 15 minutes
For Developer Portal:
- passive to active 15 - 40 minutes
- active to passive approximately 10 minutes

What to do next

As soon as a failure on a data center has been resolved, the failed data center should be brought back online and relinked to the currently active data center to maintain the highest availability; see Recovering from a failover of a two data center deployment for more information.