Two data center warm-standby deployment on Kubernetes and OpenShift
An overview of the two data center disaster recovery deployment strategy with a warm-standby data center.
Key points of the two data center disaster recovery (DR) solution
- Two data center DR is an active/warm-standby deployment for the management and portal subsystem, and must use manual failover.
- Two DataPower® Gateway subsystems must be deployed to provide high availability for the gateway service. However, this scenario doesn't provide high availability for the analytics service.
- If high availability is required for the analytics service, two analytics subsystems must be configured, one per gateway subsystem, but with this configuration Developer Portal analytics isn't possible.
- Data consistency is prioritized over data availability.
- Each data center must be setup within its own Kubernetes or OpenShift cluster.
- Latency between the two data centers must be less than 80 ms.
- Port 443 must be open between data centers for API Connect database replication.
- Replication of the management database is asynchronous, so it is possible that the most recent updates do not transfer to the warm-standby data center if there is an active data center failure.
- Replication of the portal database is synchronous, and therefore the latency is limited to 80 ms or less.
- The management and portal subsystems in the two data centers must use the same deployment profile.
- The deployment in each data center is effectively an instance of the same API Connect deployment - therefore, the endpoints, certificates, and Kubernetes secrets must all be the same.
- It is not possible to use the Automated API behavior testing application (Installing the Automated API behavior testing application) in a two data center disaster recovery configuration.
Deployment architecture
A two data center deployment model is optimized for data consistency ahead of data availability, and must use manual failover when a fault occurs. For high availability of management and portal subsystems ensure that you use a three replica deployment profile. For more information on deployment profiles, see: Deployment and component profiles.
For a single Kubernetes cluster to span multiple data centers the network latency must be no more than a few milliseconds, typically less than 10 ms. This low latency is often unachievable between geographically separated data centers. For this reason, the two data center disaster recovery solution requires that each data center is set up with its own Kubernetes cluster. The management and portal subsystem databases are continually replicated from the active datacenter to the warm-standby datacenter, this requires a network latency between the two data centers of less than 80 ms.
To achieve high availability for the DataPower Gateway, you must deploy two gateway subsystems. One subsystem in the active data center, and a separate subsystem in the warm-standby data center. Publish all Products and APIs to both gateway subsystems. The gateway subsystems are independent, and so are insulated if an issue occurs in one of them. A global dynamic router can then be used to route traffic to one gateway subsystem or the other. If high availability is also required for the analytics service, two analytics subsystems must be configured, one per gateway subsystem, but with this configuration portal analytics isn't possible.
There can be multiple portal services in an API Connect deployment (although still only one portal site per Catalog).
The deployment in each data center is effectively an instance of the same API Connect deployment – as the database is replicated, all of the configuration is also shared. Therefore, the endpoints, certificates, and secrets all need to be the same.
- Dynamic routing
-
Note: A dynamic routing device, such as a load balancer, is required to route traffic to either data center. However, neither this device nor its configuration is part of the API Connect offering. Contact IBM Services if you require assistance with configuring a dynamic router.
The dynamic router configuration must handle the traffic between the subsystems and between the data centers. For example, the consumer API calls from the portal to the management subsystem must transfer through a dynamic router so that the portal can use a single endpoint regardless of which data center the management subsystem is active in. The same is needed for calls to the Platform and Admin APIs from the other subsystems, as well as for incoming UI traffic for the portal UI, Cloud Manager UI, and API Manager UI.
The dynamic router must support SSL passthrough, so that it routes the Mutual TLS (mTLS) connections between the management subsystem and the portal, and between the management subsystem and the DataPower Gateway. The router should not do TLS termination, it should do layer 4 based routing by using SNI.
If you do not want to use mTLS between API Connect subsystems, you can Enable JWT security instead of mTLS.
- Service status
-
When a failure occurs, it is common practice to display an interstitial system outage web page. The dynamic router can be configured to display this web page when there is a failure, while the manual failover to the warm-standby data center is taking place.
- Deployment-profiles
-
Both one replica and three replica deployment profiles can be used with two data center DR, but they must be the same at each data center. For more information about the deployment-profiles, see Deployment and component profiles on Kubernetes.