Recovering the Management subsystem on VMware

Recover the management subsystem from backups after a disaster event.

Before you begin

To successfully recover the management subsystem, you must have previously completed the steps in Preparing the management subsystem for disaster recovery on VMware.

You must use the same project directory that you used for your original deployment, or a restore of your project directory backup, to ensure that configuration and secret information is transferred to the replacement deployment.

In a clustered deployment, if any one VM is corrupted then all of the VMs in the cluster must be redeployed. You cannot replace just a single corrupted VM in a cluster.

Important: Successful disaster recovery depends on recovery of both the Management subsystem and the Developer Portal subsystem. You must complete preparation steps for both subsystems in order to achieve disaster recovery. If you have to perform a restore, you must complete the restoration of the Management Service first, and then immediately restore the Developer Portal. Therefore, the backups of the Management and Portal must be taken at the same time, to ensure that the Portal sites are consistent with Management database.

Procedure

  1. Determine which backup to restore from.
    1. Obtain a list of the available backups for your backup type:
      • s3 backups
        • For IBM®'s Cloud Object Storage, you can check currently stored backups in the COS console. Select Buckets > Objects. See the Object Names displayed on the <cos_name>/backup/db panel. For example:
          20200605-105429F
          20200605-100008F
          20200605-11040F
        • For Amazon's AWS, you can check currently stored backups in the S3 console For example, under Amazon S3 > cluster_name > old cluster > backup > db:

          20200606-144315F
          20200606-145011F
          
      • SFTP backups

        View the SFTP backups available on your remote storage site:

        -rw-r--r--    1 root     root     13092333 Aug 26 08:56 20200826-154646F.tgz
        -rw-r--r--    1 root     root     18703758 Aug 26 09:10 20200826-160010F.tgz
        -rw-r--r--    1 root     root     24318561 Aug 26 09:21 20200826-161301F.tgz
        
    2. Select the backup ID of the backup you want to restore. For example, in the sample SFTP backup list, for the Aug 26 09:21 backup, the backup ID is 20200826-161301F.

      Each filename contains the date, time, and type of the backup stored. The format of the backup ID is YYYYMMDD-HHMMSS<F|I>. Full backups are denoted with a suffix F on the ID. Incremental backups are denoted with a suffix I on the ID. For incremental backups, ensure each incremental backup has its prior full backup also present in storage. You can check this by examining the ID <prior-backup-id>_<backup-id>.

    Note:

    During the disaster recovery process, the S3 configuration detail of the older management system is used, but the older management system must be in offline mode. The old subsystem must be offline because you cannot have two management systems simultaneously using the same s3 bucket name in the database backup configurations.

  2. Make sure you know the management database cluster name. Use the following steps applicable to your backup type:
    Backup type How to obtain database cluster name
    s3
    1. Open your IBM Cloud® Object Storage or AWS S3 console and proceed to the bucket location where the old Management subsystem backups are located.
    2. Download backup/db/<backup-id>/pg_data/postgresql.conf.gz. Open postgresql.conf to view the database cluster name:
      • IBM Cloud Object Storage
        cluster_name = 'm1-a6287572-postgres'
        • m1 - Management subsystem name
        • a6287572 - site name
      • AWS S3
        cluster_name = 'm1-c91dc0b9-postgres'
        • m1 - Management subsystem name
        • c91dc0b9 - site name
    SFTP
    1. Recover the Management database cluster name and siteName by examining the SFTP backup tar. Download or move the SFTP backup tar file and decompress (untar) it.
    2. Open <management-subsystem-name>-<siteName>-postgres-backrest-shared-repo/backup/db/<backup-id>/pg_data/postgresql.conf.gz which contains the management subsystem name and siteName. For example:
      # Do not edit this file manually!
      # It will be overwritten by Patroni!
      include 'postgresql.base.conf'
      
      archive_command = 'source /opt/cpm/bin/pgbackrest/pgbackrest-set-env.sh && pgbackrest archive-push "%p"'
      archive_mode = 'True'
      archive_timeout = '60'
      autovacuum_vacuum_cost_limit = '1000'
      autovacuum_vacuum_scale_factor = '0.01'
      cluster_name = 'm1-f785a3e3-postgres'

      In this example:

      • cluster_name has both the management subsystem name and siteName
      • m1 - Management subsystem name
      • f785a3e3 - site name
  3. Use your prior existing project directory, or a restore of your project directory backup, to install the Management subsystem:
    1. Create your ISO files
      apicup subsys install mgmt --out mgmtplan-out

      The --out parameter and value are required.

      In this example, the ISO files are created in the myProject/mgmtplan-out directory.
      Note: If your original ISO files are still available and you haven't upgraded from the original installation, you can reuse them. However, if you have upgraded your original deployment, you must create new ISO files using the version of apicup that corresponds to the version your API Connect installation was on at the time of the disaster. For example, do not attempt to deploy v10.0.5.1 OVAs with ISO files that were created with apicup v10.0.4.0.
    2. Deploy the files into the replacement VMs. See Deploying the Management subsystem OVA file.
    3. Verify the deployment. See Verify installation of the Management subsystem.
      Important:

      For S3, the recovery remains in an intermediate state until the restore is complete, and Postgres wal files might cause serious disk issues. To avoid this possibility, continue immediately with the next step.

      Note that if you delay completion of the restore:

      • Health check might fail. In this case, you can still proceed to the next step and perform a restore.
      • Postgres wal files might cause problems by consuming all disk space. In this case, you must either:
        • Re-install the system, prepare again for disaster recovery, and perform the restore.
        • Or increase disk space so that the system returns to a stable state, and then proceed with the restore.
  4. Once your Management subsystem is ready, confirm the backup ID noted in Step 1 is present on the sftp or s3 server.
  5. After a few moments, confirm there is a ManagementBackup of type record and its backup ID matches with the backup ID noted in Step 1.
    You can list the management backups using:
    apicup subsys list-backups <subsystem_name>

    For example:

    NAME                STATUS   ID                 CLUSTER                        SUBSYSTEM   TYPE   CR TYPE   AGE
    mgmt-backup-8hqqg   Ready    20200826-161301F   management-82b290a2-postgres   management  full   record    40s
    
  6. Perform a Management Restore using the name of the backup that has the ID you want to restore.

    For example, for ID 20200826-161301F the backup name is mgmt-backup-8hqqg.

    For instructions on how to restore, see Restoring the management subsystem.

    Once the Management Restore has completed and the database is running again, the data of the old Management subsystem will be successfully restored onto the new Management subsystem. Manual and scheduled backups should perform as normal once again.

What to do next

You should now complete the recovery steps for the Developer Portal subsystem on VMware, see Recovering the Developer Portal subsystem on VMware.