Disaster recovery

Disaster Recovery (DR) is the process of restoring service following an unrecoverable failure that impacts your main environment. DR planning typically involves creating a copy of your IBM Cloud Pak® for Integration environment in a different cluster (and possibly in a different physical location) to facilitate restoring service.

This process is one of a number of techniques, including high availability, that are used to provide resilience and service availability or recovery in the event of failure. Failures could range from physical failures at the node, data, center, AZ, or region scope, to accidental administrator actions or misconfigurations that cause a loss of service.

The DR approaches described here should be used in conjunction with high availability (HA) in order to provide the appropriate level of resilience required for the system. You should consider:

  • Relative costs, overhead, and benefits for each option.

  • Business resilience requirements of the solution.

To compare DR with HA strategies, see Resource requirements for high availability.

Considerations

Recovery options

With each approach to DR, you should consider the recovery characteristics of the solution that you create, such as:

  • Failure scenarios: For each type of failure scenarios that could occur, ranging from small scale to very large-scale issues, how do you intend your system to respond?

  • Recovery time objective (RTO): In the event of a failure, how quickly do you need to be able to recover to a functioning service? This influences whether you take a cold (infrequent backups), warm (periodic backups), or hot (fully redundant) standby approach to your deployment.

  • Recovery point objective (RPO): What is the window of acceptable data loss in the event of a disaster? This calculation helps determine whether you need to implement live replication of data to a DR site (RPO=seconds/minutes), or if a backup approach is suitable (typically RPO=hours/day).

Systems with multiple capabilities

In systems involving multiple capabilities, the answers to questions about the best recovery options will differ, depending on your judgement of how best to optimize your costs and effort to meet the specific business needs for each capability.

Cloud Pak for Integration capabilities are typically middleware that provide integration between different systems of record—each of which has its own separate approach to disaster recovery. You must consider how your entire business environment is recovered in the event of a failure, and how disparate capabilities react when differently-timed snapshots of state are restored.

Types of data

There are two key types of persistent data to consider when implementing your DR process for Cloud Pak for Integration:

Configuration data

Configuration data tends to come in two main types:

  • Static or infrequently updated configurations that are owned or controlled by the administrator. These are stored in OpenShift objects such as operator subscriptions, custom resources (CRs), ConfigMaps, and secrets.

  • Other infrequently changing persistent configuration state applied to a capability after its initial deployment, such as deployed API definitions and application subscriptions in IBM API Connect.

Dynamic runtime data

This data is state that changes very frequently (such as every second or every minute), and outside the control of the administrator. This type of data results from an application or external entity’s use of the system. Examples include messages being sent from and received by an IBM MQ queue, events sent to a topic in IBM Event Streams, and consumer applications subscribing to invoke APIs in API Connect.

You may also wish to consider the role and impact of transient or ephemeral data—which is not recovered in the event of a failure—in your solution. For example, in-flight data processing state during invocations of App Connect flows, and API Connect API invocations, will be lost in the event of a DR scenario. In that case, the calling client has to be ready to retry the failed operation as part of the overall recovery process.

Automation-based approach to recovering data

Using automation techniques in your DR strategy ensures that at any point, you can rerun the deployment pipeline against a new cluster to create an instance with the same configuration state as the original. These techniques include infrastructure as code, CI/CD, and GitOps to manage versioning, and automating the deployment of capability instances from version control. Cloud Pak for Integration can be configured in this way by using your preferred tool, such as OpenShift Pipelines, ArgoCD, Tekton, or an equivalent.

The method you use for restoring dynamic runtime data will vary for each capability to reflect:

  • The functional and non-functional characteristics of the runtime environment.

  • The business domain in which those capabilities are used.

The following table provides guidance for each capability in Cloud Pak for Integration.

Capability Configuration data approach Runtime data approach
MQ All configuration data, such as the definitions of queues and channels, are stored in the custom resource (CR) and related ConfigMap or secrets. This data should be recreated using CI/CD. Administrators should avoid applying interactive changes, which can cause configuration drift from the source of truth stored in source control. Because MQ message data is highly transient and consumed within seconds of being sent, it is not useful to take backups of it. Furthermore, restoring a backup of old message data can cause significant problems for business applications that are forced to reprocess old, duplicate data. Instead, customers typically choose to recreate new empty instances from configuration data only. Important: Following the recreation of a queue manager, it is necessary to reset the channel sequence numbers on any other queue managers to which it was connected, in order to re-establish communication flow.
Event Streams Configuration data, such as Event Streams CRs and topics, and Kafka user objects and related secrets, should be recreated from source control by using CI/CD. To provide a standalone copy of the event history in the event of a failure, you can asynchronously replicate Event data to a second active cluster using the geo-replication feature. See About geo-replication for more information.
API Connect Configuration data stored in CRs, secrets, and similar elements should be recreated from source control by using CI/CD. Configuration of provider orgs, catalogs, spaces, and deployed APIs or Products can be automated by using CI/CD, or alternatively by using the built-in management subsystem backup capability described in API Connect Backups on OpenShift. Developer Portal configuration and customization is typically saved using the built-in Portal subsystem capability described in the previous link. A related capability is the API Connect two data center deployment strategy which provides an active/warm-standby deployment of Management and Portal services across two clusters in different locations, and offers real-time replication of the Management and Portal subsystem data. Analytics data generated as the result of runtime API calls through the Gateway can be stored using the built-in Analytics subsystem backup capability described in API Connect Backups on OpenShift. Depending upon the business context and data usage some customers might choose not to back up Analytics data and instead to start from a clean analytics state in the event of a disaster.
Cloud Pak foundational services See IBM Cloud Pak foundational services backup and restore for details how to handle both OpenShift objects and other persisted configuration data. Not applicable (Cloud Pak foundational services data is all configuration rather than dynamic runtime data).
App Connect Designer Configuration data stored in CRs and similar elements should be recreated from source control by using CI/CD. Flow definitions should be saved as described in Exporting and importing flows. Account definitions should be recreated from the source information that was originally used to populate them, taking account of any subsequent rotation of password/credentials. Not applicable (no dynamic runtime type data in this case).
App Connect Dashboard Configuration data stored in CRs should be recreated from source control by using CI/CD. Other persisted configuration data should be handled as described in Backing up and restoring configuration objects in the App Connect Dashboard and Managing BAR files. Not applicable (no dynamic runtime type data in this case).
Automation assets Configuration data stored in CRs and other files should be recreated from source control via CI/CD. Asset data is handled as described in backing up and restoring your data. Not applicable (no dynamic runtime type data in this case).
Aspera Configuration data stored in CRs and other files should be recreated from source control via CI/CD. Redis database can be backed up by using OADP as documented in the Redis operator documentation. Users should separately implement an appropriate backup mechanism for their chosen external file store, such as NFS or cloud storage.
DataPower Configuration data stored in CRs, ConfigMaps, secrets, and other resources should be recreated from source control by using CI/CD. Not applicable (no dynamic runtime type data in this case).
App Connect Integration Server and Switch server Configuration data stored in CRs, ConfigMaps, secrets, and other resources should be recreated from source control by using CI/CD. Not applicable (no dynamic runtime type data in this case).
Integration assembly Assembly CRs should be recreated from source control by using CI/CD. Individual deployed instances within an Assembly should also be backed up as described in the previous rows describing capabilities. Not applicable.
Platform UI The PlatformNavigator custom resource should be recreated from source control by using CI/CD. Note that it is also important to have an appropriate backup of Cloud Pak foundational services, which is where the user authentication state is stored. Refer to the previous row describing Cloud Pak foundational services. Not applicable (no dynamic runtime type data in this case).