Disaster Recovery for IBM Runbook Automation
Configure Runbook Automation (RBA) to provide Disaster Recovery (DR) and High Availability (HA) capabilities in the context of IBM Netcool Operations Insight® between two sides. Running more than two sides is not supported by RBA.
Overview
The RBA deployment is stateless. All state is persisted in Apache CouchDB and non-persistent data is stored in Apache Kafka. As a result, the RBA DR solution includes filtered, asynchronous, bidirectional data replication between Apache CouchDB installed on two independent sides.
Quality of service properties
The replication setup of Apache CouchDB provided in this document enables an active-active setup between two RBA installations. This deployment can serve users on both sides simultaneously, including write operations.
As the replication of the CouchDB database is working asynchronously, the data changes on one site do not immediately occur on the other site. They are also committed on the site that they are written on first. This means in an outage scenario, data committed on one site, but not yet replicated to the other site is lost, until the site with the outage becomes available again. This property is often referred to as Recovery Point Objective (RPO). For the given setup, the RPO is the amount of time it takes to replicate the data, which depends on the amount of data, available hardware resources, network latency, and bandwidth.
For many disaster recovery scenarios, another important metric is the Recovery Time Objective (RTO). It measures the time between one outage happening on one site and the availability of the second site to take over operations. As the setup in this documentation refers to an active-active scenario, the RTO is zero seconds. A user can immediately connect to the other site in the event that one fails. If users of IBM® Runbook Automation access it through a load balancer or DNS entry that has to switch from using one site to the other, the RTO is equal to the time to change this behavior (not part of this documentation).
Technical details
The technique used to realize the data replication is the CouchDB replication mechanism. Refer to the official CouchDB documentation for more information. For Runbook Automation, a continuous filtered replication in both directions is setup. These replications continuously replicate changes form one side to the other and vice versa. A filtered replication is used to exclude specific documents from the replication for which a replication would provide no benefit, or would even be harmful.
Limitations
The IBM Runbook Automation Disaster Recovery solution is relying on asynchronous continuous Apache CouchDB replication. For that reason, there might be a small delay between the two sides before changes are propagated depending on the amount of data, available hardware resources, and network bandwidth. As data is replicated asynchronously and Apache CouchDB provides a conflict resolution procedure, it is not necessary to have more than a single side online to save changes. The replication can be resumed once both sides are available again.
The RBA and Netcool®/Impact integration does not work in a geo-redundant deployment with Netcool Operations Insight. However, RBA can still be used without the Netcool/Impact integration.