Considerations for high availability

Deploy a highly available Openshift cluster and deploy instances in a highly available configuration.

High Availability (HA) systems are designed to tolerate service interruption to a portion of the infrastructure while maintaining service availability. Service interruption can come from the failure of infrastructure such as compute, networking, or storage, or from maintenance, such as upgrading or replacing infrastructure.

When performing capacity planning for a high availability deployment, consider the level of service required during service interruptions. For example, if you need to maintain full service capacity during an cloud availability zone outage, you need to ensure that the nodes available within the remaining availability zones can maintain that service.

Note: High Availability (HA) vs. Disaster Recovery (DR): Both HA and DR are aimed at keeping a system in an operational state. The principal difference is that HA is intended to handle problems while a system is running, while DR is intended to handle problems after a system fails. Production systems should consider both.
There are other key differences:
  • Unlike DR, HA can't protect against corrupted data or a corrupted install.
  • HA is usually near-instantaneous, while DR may have a long recovery time, anywhere from hours to months.
  • HA is usually autonomous. DR may require a human decision to be invoked.

To learn more about disaster recovery, see Disaster recovery.

Failure domains

In an on-premises environment, depending on the level of fault tolerance required, failure domains can be:
  • A set of virtual machines on a single host, which can tolerate the failure of a virtual machine.
  • A set of virtual machine hosts, which can tolerate the failure of a virtual machine host.
  • A set of virtual machine hosts on different racks, which can tolerate the failure of a rack.
  • Infrastructure spread across data centers, which can tolerate the failure of a datacenter.

In each case, the nodes within the OpenShift cluster must be spread across failure domains.

In a cloud environment, cloud regions are typically broken up into zones which permit the spreading of nodes within an OpenShift cluster across failure domains that ensure high availability. Not all cloud regions support multiple zones, which limits the level of high availability you can implement. Different cloud providers also provide additional mechanisms within zones to ensure that virtual compute resources are spread across physical host machines.

Implementing highly available OpenShift clusters

To implement Cloud Pak for Integration in a highly available configuration, you must run Cloud Pak for Integration on an OpenShift cluster that is highly available. The OpenShift control plane uses a quorum-based approach for providing high availability, which requires your cluster to have a minimum of three control plane nodes distributed across failure domains.

Deploying Cloud Pak for Integration instances in a highly available configuration

Note: Cloud Pak foundational services is not high availability by default. For more information, see Keycloak configuration.

Each component of Cloud Pak for Integration uses one of these approaches to high availability:

Active-active replication

Active-active replication involves running multiple instances of a service that can process all workloads concurrently and use a load balancer that spreads the workload across instances. See Node configurations on Wikipedia.

Active-active replication requires compute nodes in at least 2 failure domains.

Active-passive replication

Active-passive replication involves running one or more instances of a service that process workloads (active), and one or more additional instances of the service that are on standby (passive) which are able to take over in the event of a failure in an active instance. See Node configurations on Wikipedia.

Active-active replication requires compute nodes in at least 2 failure domains.

Quorum

Quorum-based HA relies on a majority of nodes being available to vote. It requires a minimum of 3 nodes to be effective. See Quorum (distributed computing) on Wikipedia.

Quorum requires compute nodes in at least 3 failure domains.

OpenShift clusters that are configured to spread nodes across multiple failure domains use labels that are reserved ("well-known") by Kubernetes to label nodes. These labels are described in the Kubernetes documentation. When deploying Cloud Pak for Integration instances, this label is used with Kubernetes anti-affinity rules to ensure that—by default—pods are spread across failure domains to provide high availability.

When implementing high availability deployments, ensure the following:
  • All nodes are appropriately labeled to represent the failure domains within the architecture.
  • A sufficient number of pods are scheduled for the Cloud Pak for Integration instance so that it can run in a highly available configuration.

Resource requirements

The following tables indicate the minimum required values for installation of Cloud Pak for Integration in a production environment. Table 2 "drills down" to the requirements for instances.

Table 1. Production resources for Cloud Pak for Integration
Usage Nodes CPU request (cores) Memory request (GiB)
Small environment with multiple workers for failover 3 8 32
Large environment with multiple instances installed 3 32 64
Note: Some deployment environments may have restrictions on resource usage, which could require you to have more worker nodes in order to deploy particular instances. For example, public cloud providers have Limits on number of persistent volumes for public cloud providers. that may affect OpenShift cluster sizing requirements.
Table 2. Production resources for instances
Instance type High availability approach CPU request (cores) Shared worker nodes for distributing workload, request
Platform UI Active/active 2 2
Automation assets Service Availability - Failover 2.1 3
API Connect cluster Quorum 42 3
Queue manager For additional guidance, see MQ documentation.
  • Message availability - Quorum (Native HA queue manager)
  • Active/standby (multi-instance queue manager)
  • Service availability - Active/active (MQ cluster)
1, 2 or 3 3 / 2
Kafka cluster

For additional guidance, see Planning for resilience in the Event Streams documentation.

Quorum 2.8 3
Integration dashboard or Integration design
  • Stateful - Failover
  • Stateless - Active/active
1.1 or 2.1 2
Integration runtime Active/active 3 3
High speed transfer server Quorum 12 3
Enterprise gateway Quorum 12 3
IBM Cloud Pak® foundational services

For more information, see Configuring the Cloud Pak foundational services.

Active/active 3 2

Storage

Storage in a Kubernetes cluster can be accessed using multiple Persistent Volume Access modes. Cloud Pak for Integration instances take different approaches to storage use: ReadWriteOnce persistent volumes, ReadWriteMany persistent volumes, and Object Storage, which are described in the following sections. For more information, see Storage considerations.

ReadWriteOnce approach

This approach uses a stateful set, which creates a persistent volume dedicated to each pod in the set using the ReadWriteOnce (RWO) access mode.

Because only one pod ever requires access to any single persistent volume, the persistent volumes do not need to be accessible across failure domains. The system relies on the application layer replicating data across failure domains to provide high availability for the stored data.

The following instances use this approach:

  • API Connect cluster
  • Kafka cluster
  • Automation assets
  • Integration design (for state storage only, using CouchDB)
  • Queue manager
  • Cloud Pak foundational services (using MongoDB)

Most storage classes provide the ReadWriteOnce access mode. For performance reasons, using block storage-based volumes is preferred.

Automation Assets can be deployed in a "fixed single replica" mode that does not allow the number of replicas to be subsequently increased, but can be provisioned to use RWO storage. Typically, RWO volumes (such as AWS Elastic Block Storage) are not replicated across availability zones (AZ), so Automation assets is pinned to a single AZ and is therefore not available if that AZ goes offline. Some RWO storage types, such as OpenShift Data Foundation Block Storage, do support replication across AZ, so they can be used to improve resilience. When you use RWO storage that is not replicated across AZs with Automation assets, ensure that you have backup and restore configured in case the AZ goes offline. For more information on backup and restore for Cloud Pak for Integration, see Backing up and restoring IBM Cloud Pak for Integration.

ReadWriteMany approach

This approach uses a set of pods, typically managed by a replica set, which use a shared persistent volume using the ReadWriteMany (RWX) access mode.

Because multiple pods require access to the same persistent volume, the persistent volume must be accessible across failure domains to provide high availability, and the system relies on the storage layer to provide high availability for the stored data.

The following instances use this approach:

  • Automation assets
  • Integration dashboard (for storage of uploaded BAR files)

In Automation assets, the best availability is typically provided by deploying with multiple replicas that are backed by ReadWriteMany (RWX) storage. This configuration ensures that if one of the replicas fails within an Availability Zone (AZ), the service is still available. Further resilience can be achieved using multiple AZs for running the service and replicated RWX storage, which ensures that the service continues if an AZ goes down. For more information on storage options and RWO support, see Storage considerations

Note: Not all storage providers support using RWX volumes across failure domains:
  • Portworx and Openshift Data Foundation (formerly OpenShift Container Storage) ensure that storage volumes are accessible in all regions by performing replication.
  • The IBM Cloud File Service does not provide RWX volumes across multiple zones, so it is not suitable for use with HA clusters spread across multiple zones.

Object Storage approach

This approach uses an object storage endpoint that is managed separately to the lifecycle of the integration deployment. This can simplify storage management, especially when the alternative requires an RWX storage approach.

The following instances use this approach:

  • Integration dashboard (for storage of uploaded BAR files)