Disaster recovery

Disaster Recovery (DR) is the process of restoring service following an unrecoverable failure that impacts your main environment. DR planning typically involves creating a copy of your IBM Cloud Pak® for Integration environment on a different OpenShift cluster (and possibly in a different physical location) to facilitate restoring service.

This process is one of a number of techniques, including high availability, that are used to provide resilience and service availability or recovery in the event of failure. Failures could range from physical failures at the node, data, center, AZ, or region scope, to accidental administrator actions or misconfigurations that cause a loss of service.

The DR approaches described here should be used in conjunction with high availability (HA) in order to provide the appropriate level of resilience required for the system. You should consider:

  • Relative costs, overhead, and benefits for each option.

  • Business resilience requirements of the solution.

To compare DR with HA strategies, see Considerations for high availability.

Considerations

Recovery options

With each approach to DR, you should consider the recovery characteristics of the solution that you create, such as:

  • Failure scenarios: For each type of failure scenarios that could occur, ranging from small scale to very large-scale issues, how do you intend your system to respond?

  • Recovery time objective (RTO): In the event of a failure, how quickly do you need to be able to recover to a functioning service? This influences whether you take a cold (infrequent backups), warm (periodic backups), or hot (fully redundant) standby approach to your deployment.

  • Recovery point objective (RPO): What is the window of acceptable data loss in the event of a disaster? This calculation helps determine whether you need to implement live replication of data to a DR site (RPO=seconds/minutes), or if a backup approach is suitable (typically RPO=hours/day).

Systems with multiple instances

In systems with multiple Cloud Pak for Integration operators or instance types, the answers to questions about the best recovery options will differ, depending on your judgement of how best to optimize your costs and effort to meet the specific business needs for each operator or instance type.

Cloud Pak for Integration instance types are typically middleware that provide integration between different systems of record, and each has its own separate approach to disaster recovery. You must consider how your entire business environment is recovered in the event of a failure, and how disparate instances react when differently-timed snapshots of state are restored.

Types of data

There are two key types of persistent data to consider when implementing your DR process for Cloud Pak for Integration:

  1. Configuration data, which tends to come in two main types:

  • Static or infrequently updated configurations that are owned or controlled by the administrator. These are stored in OpenShift objects such as operator subscriptions, custom resources (CRs), ConfigMaps, and secrets.

  • Other infrequently changing persistent configuration state that is applied to an instance after its initial deployment, such as deployed API definitions and application subscriptions in IBM API Connect.

  1. Dynamic runtime data. This is state that changes very frequently (such as every second or every minute), and outside the control of the administrator. This type of data results from an application or external entity’s use of the system. Examples include messages being sent from and received by an IBM MQ queue, events sent to a topic in IBM Event Streams, and consumer applications subscribing to invoke APIs in API Connect.

You may also wish to consider the role and impact of transient or ephemeral data—which is not recovered in the event of a failure—in your solution. For example, in-flight data processing state during invocations of App Connect flows, and API Connect API invocations, will be lost in the event of a DR scenario. In that case, the calling client has to be ready to retry the failed operation as part of the overall recovery process.

OADP and automation

You can use Red Hat OpenShift API for Data Protection (OADP) to back up and restore some of the Cloud Pak for Integration operators and instances. To use OADP, you label the instances to back up or restore, then create OADP custom resources to define what to back up or restore and to to run the backup and restore processes whenever needed. For more information about how to take and restore backups by using OADP, see Backing up and restoring IBM Cloud Pak for Integration.

For operators and instances that are not yet supported by OADP, you can use automation to recover data. If you prefer, you can use the automation approach for all operators and instances but it is more complex than using OADP and in some cases, backs up less data.

Using automation techniques in your DR strategy means that at any point, you can rerun the deployment pipeline against a new OpenShift cluster to create an instance with the same configuration state as the original. These techniques include infrastructure as code, CI/CD, and GitOps to manage versioning, and automating the deployment of instances from version control. Cloud Pak for Integration can be configured in this way by using your preferred tool, such as OpenShift Pipelines, ArgoCD, Tekton, or an equivalent.

The method you use for restoring dynamic runtime data will vary for each operator or instance to reflect the following items:

  • The functional and non-functional characteristics of the runtime environment.

  • The business domain in which those instances are used.

The following table provides guidance for each Cloud Pak for Integration operator or instance.

Operator or instance Configuration data approach Runtime data approach
MQ All configuration data, such as the definitions of queues and channels, are stored in the custom resource (CR) and related ConfigMap or secrets. This data should be recreated using CI/CD. Administrators should avoid applying interactive changes, which can cause configuration drift from the source of truth stored in source control. Because MQ message data is highly transient and consumed within seconds of being sent, it is not useful to take backups of it. Furthermore, restoring a backup of old message data can cause significant problems for business applications that are forced to reprocess old, duplicate data. Instead, customers typically choose to recreate new empty instances from configuration data only. Important: Following the recreation of a queue manager, it is necessary to reset the channel sequence numbers on any other queue managers to which it was connected, in order to re-establish communication flow.
Event Streams Configuration data, such as Event Streams CRs and topics, and Kafka user objects and related secrets, can be backed up and restored by using OADP. Alternatively, you can recreate this data from source control by using CI/CD. To provide a standalone copy of the event history in the event of a failure, you can asynchronously replicate Event data to a second active Kafka cluster by using the geo-replication feature. See About geo-replication for more information.
Event Gateway Configuration data, such as Event Gateway resource, can be backed up and restored by using OADP or recreated from source control by using CI/CD. Not applicable as there is no dynamic runtime type data.
API Connect Configuration data stored in CRs, secrets, and similar elements should be recreated from source control by using CI/CD. Configuration of provider orgs, catalogs, spaces, and deployed APIs or Products can be automated by using CI/CD, or alternatively by using the built-in management subsystem backup functionality described in Backing up and restoring the management subsystem. Developer Portal configuration and customization is typically saved using the built-in Portal subsystem functionality described in Backing up and restoring the developer portal. A related functionality is the multiple data center deployment strategies which provides an active/warm-standby deployment of Management and Portal services across two OpenShift clusters in different locations, and offers real-time replication of the Management and Portal subsystem data. Analytics data generated as the result of runtime API calls through the Gateway can be stored using the built-in Analytics subsystem backup functionality described in Configuring analytics database backups. Depending upon the business context and data usage some customers might choose not to back up Analytics data and instead to start from a clean analytics state in the event of a disaster.
App Connect Designer Configuration data stored in CRs and similar elements can be backed up and restored by using OADP, which also backs up data in persistent volumes. You can recreate configuration data from source control by using CI/CD instead, but this approach does not recover data in persistent volumes. Not applicable (no dynamic runtime type data in this case).
App Connect Dashboard Configuration data stored in CRs can be backed up and restored by using OADP, which also backs up data in persistent volumes. You can recreate configuration data from source control by using CI/CD instead, but this approach does not recover data in persistent volumes. If you are using the CI/CD approach, other persisted configuration data should be handled as described in Backing up and restoring configuration objects in the App Connect Dashboard in Red Hat OpenShift and Managing BAR files. Not applicable (no dynamic runtime type data in this case).
Automation assets Configuration data stored in CRs and other files can be backed up and restored by using OADP. For more information, see Backing up and restoring IBM Cloud Pak for Integration. Alternatively, you can recreate this data from source control by using CI/CD. Not applicable (no dynamic runtime type data in this case).
Aspera Configuration data stored in CRs and other files should be recreated from source control via CI/CD. Redis database can be backed up by using OADP as documented in the Redis operator documentation. Users should separately implement an appropriate backup mechanism for their chosen external file store, such as NFS or cloud storage.
DataPower Configuration data stored in CRs, ConfigMaps, secrets, and other resources should be recreated from source control by using CI/CD. Not applicable (no dynamic runtime type data in this case).
App Connect Integration server, Integration runtime, and Switch server Configuration data stored in CRs, ConfigMaps, secrets, and other resources can be backed up and restored by using OADP. Alternatively, you can recreate this data from source control by using CI/CD. Not applicable (no dynamic runtime type data in this case).
Assembly Assemblies should be recreated from source control by using CI/CD. Individual deployed instances within an assembly should also be backed up as described in the previous rows describing instances. Not applicable.
Cloud Pak foundational services
  • You can use OADP to back up and restore the operators and configuration data that is stored in the CommonService Kubernetes resource; OADP also backs up the data that is stored in persistent volumes.
  • You can also recreate the operators and configuration data from source control by using CI/CD.
OADP is the only option to use for backing up identity and access management.
Platform UI The PlatformNavigator custom resource should be recreated from source control by using CI/CD. Note that it is also important to have an appropriate backup of Cloud Pak foundational services, which is where the user authentication state is stored. Refer to the previous row describing Cloud Pak foundational services. Not applicable (no dynamic runtime type data in this case).