Using VM snapshots for infrastructure backup and disaster recovery
You can use VM snapshots to backup and restore API Connect on VMware, and also for infrastructure disaster recovery.
Backup and restore procedures for API Connect on VMware are described in Backing up and restoring on VMware. The standard backup procedures back up all the state that API Connect requires but does not include the underlying infrastructure software.
In some scenarios, you might want to take backups at the infrastructure layer. You can do this by taking snapshots of the virtual machines or underlying storage volumes, providing that you follow the constraints that are described here.
Ideal times to take VM snapshots are after initial installation and configuration, and before and after upgrades. If your API Connect deployment is lost in a disaster, reverting to the last snapshot and then restoring your subsystem database backups can be faster than the standard Disaster recovery procedures.
- Requirements for taking a consistent backup of an API Connect system at the
infrastructure level
- An API Connect VMware deployment is shown in the following diagram:
- The API Connect deployment consists of multiple virtual machine images that are running Kubernetes, subsystem databases, and the API Connect micro services. Transactional communication runs between the virtual machines at all levels and all the time, even if the systems appears idle.
-
Within the API Connect deployment, multiple stateful or database layers exist that maintain a consistency protocol. For example
etcd
(upon which Kubernetes is based) uses the Raft consensus algorithm. These algorithms depend for consistency on the basic assertion that time flows forwards on all systems together. If one of the virtual machine’s state where to move backwards relative to the others by even the smallest quantum of time, then consistency is lost and the system might fail.If VM snapshots are taken across multiple virtual machines, it is certain that these snapshots cannot all be taken at precisely the same time. It is possible that the VM snapshots are restored and might appear to work correctly, but hidden deep within the VMs can be undetected inconsistencies that cause unpredictable errors. These issues can be difficult to diagnose.
It is also possible that you can test this procedure successfully multiple times and observe no problems. Irrespective of the apparent success of tests, corruption might be present in the system state. For this reason, taking VM snapshots or clones of running API Connect VMs is not supported.
- How to take a consistent backup of an API Connect system at the
infrastructure level
- The only way to take a consistent backup of an API Connect system with VM snapshots and clones is to shut down ALL of the VMs before you take the snapshots or clones.
- When the VMs are stopped, snapshots can be taken. Clones can be made from the snapshots. The set of clones represent a valid, consistent state that the API Connect system was in when it was shut down.
- When all the snapshots and clones are taken, the API Connect VMs can be restarted.
- Restoring an API Connect deployment from
VM clones:
- The cloned VMs represent a valid state of the original API Connect deployment. They restart in the same way as the original system restarted.
- If the objective is to stand up another instance of the cloned API Connect system for
testing or DR purposes, the following must be true:
- The original system and the cloned system must be isolated from one another. The clone uses the same hostnames and IP addresses as the original and so must be stood up in a separate VM hosting environment with Network Address Translation between itself and any network to which the original system is stood up.
- The cloned system must be stood up in an identical environment in terms of its hardware, software, and network.
- Disaster Recovery by using Disk Replication:
- A common DR approach is to use disk-based replication. In this case, a storage controller such as IBM’s SAN Volume Controller provides Copy Services. You can use Copy Services to create a consistent copy of a set of disks. IBM calls this feature Synchronous Remote Copy or MetroMirror. Other storage vendors have similar features.
- If you use Remote Copy to synchronize disk images to a DR site, then IBM supports API Connect used with these systems if the copy at the DR site is consistent. Ensuring this consistency is the responsibility of the storage controller and its configuration. It is critically important to ensure that ALL of the disk volumes for all the VMs or nodes in the entire API Connect deployment are in the same remote copy consistency group as shown in the following image. This ensures that DR scenarios do not create inconsistent volumes. If the disks are not in the same consistency group, then the copy is not managed consistently, and the system is not supported.
- Infrastructure-based backup of API Connect using vSphere
-
The following procedure outlines steps to clone a VMware instance of API Connect using the following topology:
- 3 Management VMs
- 3 Portal VMs
- 3 Analytics VMs
- 3 Gateways
- Each of these components has a load balancer with its own IP address. The load balancers all sit
on a single VM. The cluster is installed and configured on 13 VMs. In this example, the system was
populated with data and APIs. These must first be tested to ensure that all published APIs respond
as expected. When this is done, use vSphere UI to complete the following steps:
- Power off VMs.
To power off the VMs in sync, schedule a “shutdown guest OS” such that all the VMs shutdown at the same time. Although the shutdown does not occur at the same instant, this does allow the shutdown for every VM to be as close in time as possible.
- Take snapshots.
Take snapshots of each VM after shutdown. Using the scheduling feature in the UI, schedule a snapshot of all the VMs at the same instant. If cloning might adversely affect the system in any way, then the original VMs can be reverted to these snapshots.
- Clone VMs.
Use the clone feature of the UI to create a clone of each VM. This VM must sit in the same resource pool as the original cluster. This is an exact copy of the disks of the original VM. The clones and originals are in no way coupled.
- Power on clones.
Schedule the power-on for each clone VM such that all the clone VMs power on simultaneously. When this is done, test the cluster to ensure that the APIs respond as expected. For the example cluster topology shown earlier on this page, it can take a couple of hours after power-on for all the APIs to respond as expected.
- Power off VMs.
-