A two data center deployment strategy on Kubernetes and OpenShift

An overview of the two data center disaster recovery deployment strategy in API Connect.

High availability versus disaster recovery

High availability is concerned with configuring a system so that it continues to operate, and the users and any using systems don't experience any loss of service, in the event of hardware or software failures. Whereas, disaster recovery (DR) is concerned with the configurations and procedures that allow a system to be recovered following a catastrophic hardware, software, or operator failure that's led to a loss of service. The following two important metrics should be considered for DR solutions:
Recovery Time Objective (RTO)
The RTO is the time that it is acceptable for a system to be unavailable for during a disaster.
Recovery Point Objective (RPO)
As DR solutions are usually based on some sort of data copy or backup, it is possible that a system might be recovered to a state prior to the disaster occurring, rather than the state it was in at the actual instant of the disaster. The RPO measures how far back in time the recovery point will be, and therefore how much data might be lost. An RPO of zero would assert that no data will be lost, but such a solution is often a compromise against the cost and performance of the system.

To achieve high availability in your API Connect deployment, a minimum of three data centers are required. This configuration creates a quorum of data centers, allowing automated failover in any direction, and enabling all three data centers to be active. The quorum majority voting algorithm, allows for a single data center to be offline, and yet still maintain data consistency and availability across the remaining two data centers as they continue to represent a majority in the deployment (avoiding split-brain syndrome). However, when having three data centers is not possible, a two data center deployment provides a disaster recovery solution that has both a low Recovery Time Objective (RTO), and a low Recovery Point Objective (RPO). The following information provides an overview of the concepts and architecture required for configuring a two data center deployment in API Connect.

High-level notes about the two data center disaster recovery (DR) solution

  • Two data center DR is strictly an Active/Passive deployment for the API Manager and Developer Portal services, and must use manual failover.
  • Two DataPower® gateway subsystems must be deployed to provide high availability for the gateway service. However, this scenario doesn't provide high availability for the analytics service.
  • If high availability is required for the analytics service, two analytics subsystems must be configured, one per gateway subsystem, but with this configuration Developer Portal analytics isn't possible.
  • Data consistency is prioritized over data availability.
  • Each data center must be set up within its own Kubernetes cluster (OpenShift and VMWare are hosted on Kubernetes).
  • Latency between the two data centers must be less than 80ms.
  • Replication of the API Manager is asynchronous so, depending on the amount of latency between the two data centers, there is a small possibility that recent updates are not transferred in the event of a failure.
  • Replication of the Developer Portal is synchronous, and therefore the latency is limited to 80ms or less.
  • The API Manager and the Developer Portal services in the two data centers must either all be highly available, or none of them highly available, and the number of nodes on each data center must match.
  • The deployment in each data center is effectively an instance of the same API Connect deployment - therefore, the endpoints, certificates, and Kubernetes secrets must all be the same.

Deployment architecture

A two data center deployment model is optimized for data consistency ahead of data availability, and must use manual failover when a fault occurs. If data availability is also a priority, three data centers are required, and a quorum can be used for automatic failover. It is not possible to have automatic failover with only two data centers.

To have a single Kubernetes cluster span the two data centers the latency must be less than 6ms. However, as it is likely that the two data centers will be geographically disparate, a single Kubernetes cluster cannot be used to span both data centers, instead each data center must be set up within its own Kubernetes cluster. To achieve data replication in a reasonable time frame in this scenario, latency between the two data centers must still be less than 80ms, and database replication is handled outside of Kubernetes.

To achieve high availability for the DataPower Gateway, you must deploy two gateway subsystems; one subsystem in the active data center, and a separate subsystem in the standby passive data center, and then publish all Products and APIs to both gateway subsystems. The gateway subsystems are independent, and so are insulated if an issue occurs in one of them. A global dynamic router can then be used to route traffic to one gateway subsystem or the other. If high availability is also required for the analytics service, two analytics subsystems must be configured, one per gateway subsystem, but with this configuration Developer Portal analytics isn't possible.

A Catalog can have only a single Developer Portal service associated with it, and so this service needs to be highly available across both data centers. The requirement for a highly available service across both data centers also applies to the API Manager service, as it forms the core of API Connect, and is used for all user authentication and application/subscription operations in the Developer Portal.

There can be multiple Developer Portal services in an API Connect deployment (although still only one Developer Portal per Catalog), and those Developer Portal services can be in different data centers to API Manager (for example, Hybrid Cloud), so there must be the ability to monitor and failover individual services.

The deployment in each data center is effectively an instance of the same API Connect deployment – as the database is shared, all of the configuration is also shared. Therefore, the endpoints, certificates, and Kubernetes secrets all need to be the same.

Active/Passive configuration

The data centers are run with an Active/Passive failover configuration.

Dynamic routing
Note: A dynamic routing device, such as a load balancer, is required to route traffic to either data center. However, neither this device nor its configuration is part of the API Connect offering. Contact IBM Services if you require assistance with configuring a dynamic router.

The dynamic router configuration must handle the traffic between the subsystems, as well as between the data centers. For example, the consumer API calls from the Developer Portal to API Manager must transfer via a dynamic router so that the Developer Portal can use a single endpoint regardless of which data center API Manager is active in. The same is needed for calls to the Platform and Admin APIs from the other subsystems, as well as for incoming UI traffic for the Developer Portal UI, Cloud Manager UI, and API Manager UI.

The dynamic router must support SSL passthrough, so that it also routes the Mutual TLS (mTLS) connections between API Manager and the Developer Portal, and between API Manager and the DataPower Gateway. There should be no need to do TLS termination on the router, as it should be able to do layer 4 based routing by using SNI.

Service status

When a failure occurs, it is common practice to display an interstitial system outage web page. The dynamic router can be configured to display this web page when there is a failure, while the manual failover to the passive data center is taking place.

Deployment-profiles - Dev and Prod

Two data center DR deployment architecture is possible whether the Kubernetes clusters in each data center have a Dev or a Prod profile. In other words, you can still configure a two data center DR topology when there is only 1 pod of each type, as well as when there are 3 pods of each type. For more information about the deployment-profiles, see Installing the API Connect subsystems on Kubernetes, and Requirements for initial deployment on VMware.