Recovering the management subsystem from S3 backups

You can recover the management subsystem from S3 backups after a disaster event.

Before you begin

To successfully recover the management subsystem, you must have previously completed the steps in Preparing the management subsystem for disaster recovery.
Important: Successful disaster recovery depends on recovery of both the Management subsystem and the Developer Portal subsystem. You must complete preparation steps for both subsystems in order to achieve disaster recovery. If you have to perform a restore, you must complete the restoration of the Management Service first, and then immediately restore the Developer Portal. Therefore, the backups of the Management and Portal must be taken at the same time, to ensure that the Portal sites are consistent with Management database.

About this task

To recover from a disaster event, you must create a new IBM API Connect® installation with a running IBM API Connect Operator.

Note: Limitation for backups created on Version 10.0.2
  • If restoring a Version 10.0.2 backup onto a new Version 10.0.2 deployment, performing a restore may not work if the subsystem CR name exceeds 15 characters. This limitation applies only to restoring onto Version 10.0.2.
  • Restoration is supported for Version 10.0.2 backups onto a Version 10.0.3.0 or later deployment when subsystem name exceeds 15 characters, as long as the correct spec.originalUID is specified upon restore. See Step 3.f.

Procedure

  1. Determine which backup to restore from.
    • For IBM®'s Cloud Object Storage, you can check currently stored backups in the COS console. Select Buckets > Objects. See the Object Names displayed on the <cos_name>/backup/db panel. For example:
      20200605-105429F
      20200605-100008F
      20200605-11040F
    • For Amazon's AWS, you can check currently stored backups in the S3 console For example, under Amazon S3 > cluster_name > old cluster > backup > db:

      20200606-144315F
      20200606-145011F
      

    The format of the backup ID is YYYYMMDD-HHMMSS<F|I>

    Take note of the ID of the backup you wish to restore to. Each ID contains the date, time, and type of the backup stored. This will be used later in the procedure.

    1. Incremental backups are denoted with a suffix I on the ID, ensure each incremental backup has it's prior backup also present in storage. You can check this by examining the ID <prior-backup-id>_<backup-id>.
    2. Full backups are denoted with a suffix F on the ID.
    Note:

    During the disaster recovery process, the S3 configuration detail of the older management system is used, but the older management system must be in offline mode. The old subsystem must be offline because you cannot have two management systems simultaneously using the same s3 bucket name in the database backup configurations.

  2. Make sure you know the management database cluster name.

    You can get this name from the original management subsystem CR. You made note of this name in Step 2.d in Preparing the management subsystem for disaster recovery.

    If you are not able to recover the original management subsystem CR, use the following steps to recover the Management database cluster name:

    1. Open your IBM Cloud® Object Storage or AWS S3 console and proceed to the bucket location where the old Management subsystem backups are located.
    2. Download backup/db/<backup-id>/pg_data/postgresql.conf.gz. Open postgresql.conf to view the database cluster name:
      • IBM Cloud Object Storage
        .
        .
        cluster_name = 'm1-a6287572-postgres'
        .
        .
        • m1 - Management subsystem name
        • a6287572 - site name
      • AWS S3
        .
        .
        cluster_name = 'm1-c91dc0b9-postgres'
        .
        .
        • m1 - Management subsystem name
        • c91dc0b9 - site name
  3. Before installing the replacement management subsystem CR:
    1. Apply the YAML file that contains the Management Database Encryption Secret into the cluster. For example, where encryption-bin-secret.yaml is the local YAML file containing the backup-up encryption secret:
      kubectl apply -f encryption-bin-secret.yaml -n <namespace>.

      Replace <namespace> with the namespace being used for the management subsystem installation.

      This command re-creates the original Management Database encryption secret on the cluster. It will be named as the original name of the secret.

    2. Add the following encryptionSecret subsection to the spec of the Management CR. For example, if management-enc-key is the name of the newly created secret on the cluster containing the original Management Database encryption secret from the previous step:
      encryptionSecret:
        secretName: management-enc-key
    3. For each of the saved YAML Files that contain the Management Client Application Credential Secrets, apply each file into the cluster using the following command:
      kubectl create -f <secret_name>.yaml -n <namespace>

      where <secret_name> is the local YAML file containing one of the backed-up Credential Secrets.

      Repeat this for each of the backed-up Credential Secrets. These are the secrets you saved n Step 2.b in Preparing the management subsystem for disaster recovery.

      These commands will re-create the original Management Client Application Credential Secrets on the cluster. Each will be named as the original name of the Secret.

    4. Add the following customApplicationCredentials subsection to the spec subsection of the Management CR:
      customApplicationCredentials:
      - name: atm-cred
        secretName: management-atm-cred
      - name: ccli-cred
        secretName: management-ccli-cred
      - name: cli-cred
        secretName: management-cli-cred
      - name: cui-cred
        secretName: management-cui-cred
      - name: dsgr-cred
        secretName: management-dsgr-cred
      - name: juhu-cred
        secretName: management-juhu-cred
      - name: ui-cred
        secretName: management-ui-cred
      

      For each named credential above, the secretName is given as the corresponding name of the newly created secret from Step 3.c.

    5. Add the siteName property to the spec of the Management CR.

      For example, if a2a5e6e2 is the original siteName that was noted after the installation of the original Management Subsystem:

      siteName: a2a5e6e2
    6. Version 10.0.3.0 or later: Add the originalUID: property to the spec of the Management CR.

      When recreating a system, to restore a backup into it, you must specify the same spec.originalUID in the CR as was present in the system that was backed up. If the spec.originalUID in the new CR for recreating the system does not match the spec.originalUID that was present in the system that was backed up, the restore will fail.

      
      spec:
        originalUID: "fa0f6f49-b931-4472-b84d-0922a9a92dfd"
      
      Note:
      • For Version 10.0.3.0 or later, if you do not specify spec.originalUID in the new CR, the operator automatically sets the Management CR value of spec.originalUID to match the new CR value metadata.uid. In this case, the restore will fail because the spec.originalUID in the saved (backed-up) CR does not match spec.originalUID in the new CR.
      • The originalUID is only essential when the subsystem CR name exceeds 15 characters in length, or 10 characters limit for the API Connect Cluster CR. Recommended practice is that all backups should include the originalUID for Management.
      • See also Step 2.d in Preparing the management subsystem for disaster recovery.
    7. Verify that the name of the management subsystem in the CR matches with the old management subsystem name, as described in Step 2.d in Preparing the management subsystem for disaster recovery.
  4. Install the Management subsystem CR with the values obtained in Step 2.d in Preparing the management subsystem for disaster recovery.
    Important: The hostnames of the endpoints cannot be changed, and must remain the same in the Management CR YAML file used for installation now as they were for the original installation.

    To review installation of the management subsystem, see Installing the Management subsystem cluster.

    Once the Management subsystem is installed you may notice backup job pods and stanza-create job pods in Error state.

    m1-82b290a2-postgres-stanza-create-4zcgz                    0/1     Error       0          35m
    m1-82b290a2-postgres-full-sch-backup-2g9hm                  0/1     Error       0          20m
    

    This is expected behavior.

    • The stanza-create job normally expects buckets or subdirectories within buckets to be empty. However, since we have configured the Management subsystem with a pre-populated bucket (ie. where our backups exist), the job will go into Error state.
    • Any scheduled or manual backups will go into Error state. While we have configured the Management subsystem with our already populated S3 bucket, the new database isn't yet configured to write backups into remote storage.
    Important:

    For S3, the recovery remains in an intermediate state until the restore is complete, and Postgres wal files might cause serious disk issues. To avoid this possibility, continue immediately with the next step.

    Note that if you delay completion of the restore:

    • Health check might fail. In this case, you can still proceed to the next step and perform a restore.
    • Postgres wal files might cause problems by consuming all disk space. In this case, you must either:
      • Re-install the system, prepare again for disaster recovery, and perform the restore.
      • Or increase disk space so that the system returns to a stable state, and then proceed with the restore.
  5. Get a list of available backups and confirm that the backup ID noted in Step 1 is in the backup list.

    The API Connect Operator automatically reads backups from configured remote storage and populates the list of available backups we can restore to.

    $ kubectl get mgmtb
    NAME                STATUS     ID                 CLUSTER                SUBSYSTEM   TYPE   CR TYPE   AGE
    mgmt-backup-4z87f   Complete   20200606-145011F   m1-82b290a2-postgres   m1          full   record    11m
    mgmt-backup-6bms2   Complete   20200606-144315F   m1-82b290a2-postgres   m1          full   record    11m
    
  6. Update the Management subsystem CR databaseBackup.schedule to what was noted in Step 2.d in Preparing the management subsystem for disaster recovery.
  7. Perform a Management Restore using the name of the backup that has the ID you want to restore. For example, as shown in Step 5, for ID 20200606-145011F the backup name is mgmt-backup-4z87f.

    For more info on restoring the management subsystem, see Restoring the management subsystem (v10.0.1.1 or later).

  8. Use the following command to check the status of the restore:
    kubectl get mgmtr -n <namespace>

    Once the Management Restore has completed and the database is running again, the data of the old Management subsystem will be successfully restored onto the new Management subsystem. Note:

    • Manual and scheduled backups should perform as normal once again
    • stanza-create jobs will continue to report Error state as stated in Step 4.
  9. Verify that the Cloud Manager UI can now be logged-in to as before, and that Provider Orgs exist as before.

    The restore from the disaster event is now complete.

What to do next

You should now complete the recovery steps for the Developer Portal subsystem on Kubernetes, see Recovering the Developer Portal after a disaster.