Recovering from a failover of a two data center deployment

How to recover a two data center disaster recovery deployment on Kubernetes and OpenShift.

Before you begin

Ensure that you understand the concepts of two data center disaster recovery in API Connect. For more information, see A two data center deployment strategy on Kubernetes and OpenShift.

About this task

After a failure has been resolved, the failed data center can be brought back online and re-linked to the currently active data center. It's important to do this as soon as possible, in order to reinstate disaster recovery. Otherwise, if another failure occurred before the failed data center is brought back online, there could be a complete outage.

The failed data center can be brought back online as the new active primary data center, or it can be brought back as a passive secondary data center and the currently active data center kept as the primary one. This decision will be based on your own company policies about recovering from a failure.

The amount of data that is sent when recovering from a failover, will depend on how long the outage lasted, and how much activity there was during that period.

The operational states of a Developer Portal service:

Operational state	Description
`progressing to active`	Pods are progressing to the active state, but none are capable of serving traffic.
`progressing to active (ready for traffic)`	At least one pod of each type is ready for traffic. The dynamic router can be linked to this service.
`active`	All of the pods are ready and in the correct disaster recovery state for the active data center.
`progressing to passive`	Pods are progressing to the passive state, but none are capable of serving traffic.
`progressing to passive (ready for traffic)`	At least one pod of each type is ready for traffic.
`passive`	All of the pods are ready and in the correct disaster recovery state for the passive data center.
`progressing to down`	Pods are moving from passive state to down.
`down`	The multi-site HA mode is deleted from the passive data center.

You can run the following command to check the operational state of a service:

kubectl describe ServiceName

Where ServiceName is the name of the API Manager or Developer Portal service.

Important:

If the active data center is offline, do not remove the multi-site HA CR section from the passive data center, as this action deletes all the data from the databases. If you want to revert to a single data center topology, you must remove the multi-site HA CR section from the active data center, and then redeploy the passive data center. If the active data center is offline, you must first change the passive data center to be active, and then remove the multi-site HA CR section from the now active data center. The passive site must be uninstalled and redeployed as a clean install before it can be used. For more information, see Removing a two data center deployment.
If the passive data center is offline for more than 24 hours, there can be issues with the disk space on the active data center so you must revert your deployment to a single data center topology. To revert to a single data center topology, you must remove the multi-site HA CR section from the active data center. When the passive site has been redeployed, the multi-site HA CR can be reapplied to the active site. For more information, see Removing a two data center deployment.

Procedure

Recovering from an API Manager failover

The following instructions show how to bring data center one (DC1) back as the primary (active) data center, and assume that DC1 is now online and in the passive Ha mode. If you prefer, you can keep DC1 as the secondary data center, and leave data center two (DC2) as the primary (active) data center.
1. Change DC2 (in this example, located in Raleigh) to be passive.
  Edit the DC2 ManagementCluster custom resource (CR) file management_cr by running the following command:
```
kubectl edit mgmt <mgmt-cluster-name> -n <namespace>
```
  and change the mode to passive:
```
multiSiteHA:
  mode: passive
```
  Where:
  - <mgmt-cluster-name> is the name specified in the subsystem CR at installation time. You can check the name by running kubectl get ManagementCluster -n <namespace>.
  - <namespace> is the target installation namespace in the Kubernetes cluster.
2. Run the following command on DC2 to check that the API Manager service is ready for traffic:
```
kubectl describe <mgmt-cluster-name> -n <namespace>
```
  The service is ready for traffic when the Ha mode part of the Status object is set to passive or progressing to passive (ready for traffic). For example:
```
Status:
  ...
  Ha mode:                            passive
  ...
```
3. Change DC1 (in this example, located in Dallas) to be active.
  Edit the DC1 ManagementCluster custom resource (CR) file management_cr by running the following command:
```
kubectl edit mgmt <mgmt-cluster-name> -n <namespace>
```
  and change the mode to active:
```
multiSiteHA:
  mode: active
```
  Where:
  - <mgmt-cluster-name> is the name specified in the subsystem CR at installation time. You can check the name by running kubectl get ManagementCluster -n <namespace>.
  - <namespace> is the target installation namespace in the Kubernetes cluster.
4. Run the following command on DC1 to check that the API Manager service is ready for traffic:
```
kubectl describe <mgmt-cluster-name> -n <namespace>
```
  The service is ready for traffic when the Ha mode part of the Status object is set to active or progressing to active (ready for traffic). For example:
```
Status:
  ...
  Ha mode:                            active
  ...
```
5. Update the dynamic router to redirect all traffic to DC1 instead of DC2. Note that until the DC1 site becomes active, the UIs might not be operational.
Recovering from a Developer Portal failover

When recovering from a Developer Portal failover, how much data is sent from the now passive data center (in this example, DC2) back to the newly recovered active data center (in this example, DC1) varies depending on how much time has passed and how many changes have been made in the meantime.

If possible, the Portal sends an Incremental State Transfer (IST), which is a replay of the transactions that have happened since the failed data center, DC1, became unreachable. The IST can grow to be up to 3 GB in size. If the IST grows larger than 3 GB, then the Portal sends a State Snapshot Transfer (SST) instead, which is basically a full copy of the whole database. How long this process takes depends on how many Portal sites there are, how much content in each of those sites, how active they are (as in number of transactions), and how fast the network speed is between the data centers. However, it is likely to take in the range 5 - 60 minutes.

You can check whether the Developer Portal servers are in sync by running the status command in the portal-www pod admin container. If all servers show as primary, then they are in sync and traffic can be directed wherever needed.

The following instructions show how to bring DC1 back as the active primary data center, and switch DC2 to be the passive secondary data center. Again, you can keep DC2 as the active primary data center if you prefer, and leave DC1 to be passive.
1. Set DC2 (in this example, located in Raleigh) to be passive.
  Edit the DC2 PortalCluster custom resource (CR) file portal_cr by running the following command:
```
kubectl edit ptl <ptl-cluster-name> -n <namespace>
```
  and change the mode to passive:
```
multiSiteHA:
  mode: passive
```
  Where:
  - <ptl-cluster-name> is the name specified in the subsystem CR at installation time. You can check the name by running kubectl get PortalCluster -n <namespace>.
  - <namespace> is the target installation namespace in the Kubernetes cluster.
2. Run the following command on DC2 to check the Developer Portal service status:
```
kubectl describe <ptl-cluster-name> -n <namespace>
```
  You can continue to the next step when the Ha mode part of the Status object is set to progressing to passive, or any of the passive states. For example:
```
Status:
  ...
  Ha mode:                            progressing to passive
  ...
```
  Important: You must not wait for the active DC Status to become passive, before changing the passive DC to become active. You should change the DC2 mode to passive, and then run the kubectl describe ServiceName command, and as soon as the Ha mode is set to progressing to passive, you must change DC1 to be active.
3. Set DC1 (in this example, located in Dallas) to be active.
  Edit the DC1 PortalCluster custom resource (CR) file portal_cr by running the following command:
```
kubectl edit ptl <ptl-cluster-name> -n <namespace>
```
  and change the mode to active:
```
multiSiteHA:
  mode: active
```
  Where:
  - <ptl-cluster-name> is the name specified in the subsystem CR at installation time. You can check the name by running kubectl get PortalCluster -n <namespace>.
  - <namespace> is the target installation namespace in the Kubernetes cluster.
4. Run the following command on DC1 to check that the Developer Portal service is ready for traffic:
```
kubectl describe <ptl-cluster-name> -n <namespace>
```
  The service is ready for traffic when the Ha mode part of the Status object is set to active or progressing to active (ready for traffic). For example:
```
Status:
  ...
  Ha mode:                            active
  ...
```
5. Update the dynamic router to redirect all traffic to DC1 instead of DC2.