Resource requirements for production environments (high availability)

To deploy a production system, you must have a highly available Openshift cluster and deploy Cloud Pak for Integration capabilities in a highly available configuration.

High Availability (HA) systems are designed to tolerate service interruption to a portion of the infrastructure while maintaining service availability. Service interruption can come from the failure of infrastructure such as compute, networking, or storage, or from maintenance, such as upgrading or replacing infrastructure.

When performing capacity planning for a high availability deployment, consider the level of service required during service interruptions. For example, if you need to maintain full service capacity during an cloud availability zone outage, you need to ensure that the nodes available within the remaining availability zones can maintain that service.

Note: High Availability (HA) vs. Disaster Recovery (DR): Both HA and DR are aimed at keeping a system in an operational state. The principal difference is that HA is intended to handle problems while a system is running, while DR is intended to handle problems after a system fails. Production systems should consider both.
There are other key differences:
  • Unlike DR, HA can't protect against corrupted data or a corrupted install.
  • HA is usually near-instantaneous, while DR may have a long recovery time, anywhere from hours to months.
  • HA is usually autonomous. DR may require a human decision to be invoked.

To learn more about disaster recovery, see Disaster recovery.

Failure domains

In an on-premises environment, depending on the level of fault tolerance required, failure domains can be:
  • A set of virtual machines on a single host, which can tolerate the failure of a virtual machine.
  • A set of virtual machine hosts, which can tolerate the failure of a virtual machine host.
  • A set of virtual machine hosts on different racks, which can tolerate the failure of a rack.
  • Infrastructure spread across data centers, which can tolerate the failure of a datacenter.

In each case, the nodes within the cluster must be spread across failure domains.

In a cloud environment, cloud regions are typically broken up into zones which permit the spreading of nodes within a cluster across failure domains that ensure high availability. Not all cloud regions support multiple zones, which limits the level of high availability you can implement. Different cloud providers also provide additional mechanisms within zones to ensure that virtual compute resources are spread across physical host machines.

Implementing highly available OpenShift clusters

To implement Cloud Pak for Integration in a highly available configuration, you must run Cloud Pak for Integration on an OpenShift cluster that is highly available. The OpenShift control plane uses a quorum-based approach for providing high availability, which requires your cluster to have a minimum of three control plane nodes distributed across failure domains.

Deploying Cloud Pak for Integration instances in a highly available configuration

Note: Cloud Pak foundational services is not high availability by default. See Hardware requirements and recommendations for foundational services for more information.

Each component of Cloud Pak for Integration uses one of these approaches to high availability:

Active-active replication

Active-active replication involves running multiple instances of a service that can process all workloads concurrently and use a load balancer that spreads the workload across instances. See Node configurations on Wikipedia.

Active-active replication requires compute nodes in at least 2 failure domains.

Active-passive replication

Active-passive replication involves running one or more instances of a service that process workloads (active), and one or more additional instances of the service that are on standby (passive) which are able to take over in the event of a failure in an active instance. See Node configurations on Wikipedia.

Active-active replication requires compute nodes in at least 2 failure domains.

Quorum

Quorum-based HA relies on a majority of nodes being available to vote. It requires a minimum of 3 nodes to be effective. See Quorum (distributed computing) on Wikipedia.

Quorum requires compute nodes in at least 3 failure domains.

OpenShift clusters that are configured to spread nodes across multiple failure domains label nodes with labels that are reserved ("well-known") by Kubernetes. These labels are listed in the Kubernetes documentation. When deploying Cloud Pak for Integration instances, this label is used with Kubernetes anti-affinity rules to ensure that—by default—pods are spread across failure domains to provide high availability.

Administrators implementing high availability deployments must ensure their nodes are appropriately labeled to represent the failure domains within the architecture, and that enough pods are scheduled for the Cloud Pak for Integration instance to allow it to run in a highly available configuration.

Resource requirements

The following tables indicate the minimum required values for installation of Cloud Pak for Integration in a production environment. Table 2 "drills down" to the requirements for individual capabilities.

Table 1. Production resources for Cloud Pak for Integration
Usage Nodes CPU request (cores) Memory request (GiB)
Small environment with multiple workers for failover 3 8 32
Large environment with multiple integration capabilities installed 3 32 64
Note: Some deployment environments may have restrictions on resource usage, which could require you to have more worker nodes in order to deploy particular Cloud Pak for Integration capabilities. For example, public cloud providers have Limits on number of persistent volumes for public cloud providers. that may affect cluster sizing requirements.
Table 2. Production resources for integration capabilities
Capability (operator name) High availability approach CPU request (cores) Shared worker nodes request
Platform UI (IBM Cloud Pak for Integration) Active/active 2.7 3
Automation assets (IBM Automation Foundation assets) Service Availability - Failover 2.1 3
Inegration tracing (Operations Dashboard)
  • Store - Quorum
  • Scheduling, Configuration Database - Failover
  • Front end, tasks processing - Active/Active
21 3
API management (IBM API Connect) Quorum 42 3
Messaging (IBM MQ) For additional guidance, see MQ documentation.
  • Message availability - Quorum (Native HA queue manager)
  • Active/standby (multi-instance queue manager)
  • Service availability - Active/active (MQ cluster)
1, 2 or 3 3 / 2
Event Streams (IBM Event Streams)

For additional guidance, see Event Streams documentation.

Quorum 20.8 3
Application integration (IBM App Connect)
  • Stateful - Failover
  • Stateless - Active/Active
1.1 or 2.1 2
High speed transfer server (IBM Aspera HSTS) Quorum 12 3
Gateway (IBM DataPower Gateway) Quorum 12 3
Monitoring, licensing, and related services (Foundational services) (IBM Cloud Pak® foundational services) Varies by service 6 3

Storage

Storage in a Kubernetes cluster can be accessed using multiple Persistent Volume Access modes. Cloud Pak for Integration instances take different approaches to storage use: ReadWriteOnce persistent volumes, ReadWriteMany persistent volumes, and Object Storage, which are described in the following sections. For more information, see Storage considerations.

ReadWriteOnce approach

This approach uses a stateful set, which creates a persistent volume dedicated to each pod in the set using the ReadWriteOnce (RWO) access mode.

Because only one pod ever requires access to any single persistent volume, the persistent volumes do not need to be accessible across failure domains. The system relies on the application layer replicating data across failure domains to provide high availability for the stored data.

Cloud Pak for Integration components that use this approach are:

  • API
  • Event Streams
  • Automation assets (for metadata storage only, using CouchDB)
  • Integration tracing (partial usage)
  • App Connect Designer (for state storage only, using CouchDB)
  • Messaging
  • Cloud Pak foundational services (using MongoDB)

Most storage classes provide the ReadWriteOnce access mode. For performance reasons, using block storage-based volumes is preferred.

Considerations for high availability with the Cloud Pak foundational services small profile

With the small profile, Cloud Pak foundational services runs multiple replicas of stateful pods, so no data is lost on failure. The small profile runs only a single replica of Cloud Pak foundational services workload pods. Therefore, when a workload pod is interrupted, the resulting loss of service could impact authentication while the pod is rescheduled:

  • You might not be able to establish new UI connections until the IAM services are rescheduled.
  • You might not be able to run command-line tools when platform API services are rescheduled.
  • You might not be able to access the Admin Hub or establish a new login session while the common UI services are rescheduled.
  • You might see UI interruption when a new capability is added to the solutions (for example, installing a new capability in a new namespace).

Although UI, CLI, and API authentication could be affected, these authentication flows are not used to process production workload through CP4I instances.

If this is not suitable for your deployment, run Cloud Pak foundational services in a medium or large profile.

ReadWriteMany approach

This approach uses a set of pods, typically managed by a replica set, which use a shared persistent volume using the ReadWriteMany (RWX) access mode.

Because multiple pods require access to the same persistent volume, the persistent volume must be accessible across failure domains to provide high availability, and the system relies on the storage layer to provide high availability for the stored data.

Cloud Pak for Integration capabilities that use this approach are:

  • IBM Cloud Pak for Integration (for Platform UI)
  • Automation Foundation assets/formerly Asset Repository (for file storage only)
  • Integration tracing/formerly Operations Dashboard (partial usage)
  • App Connect Dashboard (for storage of uploaded BAR files)
Note: Not all storage providers support using RWX volumes across failure domains:
  • Portworx and Openshift Data Foundation (formerly OpenShift Container Storage) ensure that storage volumes are accessible in all regions by performing replication.
  • The IBM Cloud File Service does not provide RWX volumes across multiple zones, so it is not suitable for use with HA clusters spread across> multiple zones.

Object Storage approach

This approach uses an object storage endpoint that is managed separately to the lifecycle of the integration deployment. This can simplify storage management, especially when the alternative requires an RWX storage approach.

Cloud Pak for Integration capabilities that use this approach are:

  • App Connect Dashboard (for storage of uploaded BAR files)