Two data center deployment strategy on Kubernetes and OpenShift

An overview of the two data center disaster recovery deployment strategy in API Connect.

The two data center disaster recovery (2DCDR) deployment configuration provides continuous replication of the management and portal subsystem databases to a warm-standby deployment in another data center.
Important: The 2DCDR deployment adds complexity to the API Connect installation procedures, and all subsequent maintenance procedures such as backup, restore, and upgrade.

2DCDR is only recommended if you are concerned about a complete site outage, and want to be able to switch to standby management and portal subsystems in a different data center as quickly as possible.

The following list covers some scenarios where 2DCDR might appear to be the best strategy, but simpler alternatives are available:

  • If you are concerned about a gateway outage due to data center failure.

    In this case, you can create a gateway service in a different data center and register it with the management subsystem in your main data center. If your main data center fails, your remote gateway continues to processes calls to your published APIs.

  • If you are concerned about a management or portal subsystem failure (rather than a complete data center failure).

    In this case, you can use the disaster recovery procedures to restore the management and portal subsystems.

  • If you are concerned about your users not being able to access the management and portal subsystems during a data center outage.

    In this case, you can maintain another data center as a cold-standby, where you have the API Connect CR YAML files, custom secrets, and database backups ready for deployment and restoration.

    The cold-standby data center must have the same network configuration as your main data center so that all the endpoints in your management and portal subsystem backups are valid there.

    Copy your management and portal CRs, custom secrets, and database backups to your cold-standby data center as regularly as you take these backups.

Key points of the two data center disaster recovery (DR) solution

  • Two data center DR is an active/warm-standby deployment for the API Manager and Developer Portal services, and must use manual failover.
  • Two DataPower® Gateway subsystems must be deployed to provide high availability for the gateway service. However, this scenario doesn't provide high availability for the analytics service.
  • If high availability is required for the analytics service, two analytics subsystems must be configured, one per gateway subsystem, but with this configuration Developer Portal analytics isn't possible.
  • Data consistency is prioritized over data availability.
  • Each data center must be setup within its own Kubernetes or OpenShift cluster.
  • Latency between the two data centers must be less than 80 ms.
  • Replication of the API Manager is asynchronous, so it is possible that the most recent updates do not transfer to the warm-standby data center if there is an active data center failure.
  • Replication of the Developer Portal is synchronous, and therefore the latency is limited to 80 ms or less.
  • The API Manager and the Developer Portal services in the two data centers must use the same deployment profile.
  • The deployment in each data center is effectively an instance of the same API Connect deployment - therefore, the endpoints, certificates, and Kubernetes secrets must all be the same.
  • It is not possible to use the Automated API behavior testing application (Installing the Automated API behavior testing application) in a two data center disaster recovery configuration.

Deployment architecture

A two data center deployment model is optimized for data consistency ahead of data availability, and must use manual failover when a fault occurs. For high availability of Management and Portal subsystems ensure that you use a three replica deployment profile. For more information on deployment profiles, see: Planning your deployment topology.

For a single Kubernetes cluster to span multiple data centers the network latency must be no more than a few milliseconds, typically less than 10 ms. This low latency is often unachievable between geographically separated data centers. For this reason, the two data center disaster recovery solution requires that each data center is set up with its own Kubernetes cluster. The Management and Portal subsystem databases are continually replicated from the active datacenter to the warm-standby datacenter, this requires a network latency between the two data centers of less than 80 ms.

To achieve high availability for the DataPower Gateway, you must deploy two gateway subsystems. One subsystem in the active data center, and a separate subsystem in the warm-standby data center. Publish all Products and APIs to both gateway subsystems. The gateway subsystems are independent, and so are insulated if an issue occurs in one of them. A global dynamic router can then be used to route traffic to one gateway subsystem or the other. If high availability is also required for the analytics service, two analytics subsystems must be configured, one per gateway subsystem, but with this configuration Developer Portal analytics isn't possible.

There can be multiple Developer Portal services in an API Connect deployment (although still only one Developer Portal site per Catalog).

The deployment in each data center is effectively an instance of the same API Connect deployment – as the database is replicated, all of the configuration is also shared. Therefore, the endpoints, certificates, and Kubernetes secrets all need to be the same.

Dynamic routing
Note: A dynamic routing device, such as a load balancer, is required to route traffic to either data center. However, neither this device nor its configuration is part of the API Connect offering. Contact IBM Services if you require assistance with configuring a dynamic router.

The dynamic router configuration must handle the traffic between the subsystems and between the data centers. For example, the consumer API calls from the Developer Portal to API Manager must transfer through a dynamic router so that the Developer Portal can use a single endpoint regardless of which data center API Manager is active in. The same is needed for calls to the Platform and Admin APIs from the other subsystems, as well as for incoming UI traffic for the Developer Portal UI, Cloud Manager UI, and API Manager UI.

The dynamic router must support SSL passthrough, so that it routes the Mutual TLS (mTLS) connections between API Manager and the Developer Portal, and between API Manager and the DataPower Gateway. The router should not do TLS termination, it should do layer 4 based routing by using SNI.
Note: From v10.0.5.3, it is possible to disable mTLS and use JWT instead, which allows the load-balancers to do TLS termination. For more information, see Enable JWT security instead of mTLS.
Service status

When a failure occurs, it is common practice to display an interstitial system outage web page. The dynamic router can be configured to display this web page when there is a failure, while the manual failover to the warm-standby data center is taking place.

Deployment-profiles

Both one replica and three replica deployment profiles can be used with two data center DR, but they must be the same at each data center. For more information about the deployment-profiles, see Planning your deployment topology on Kubernetes, and Requirements for initial deployment on VMware.