Restoring Infrastructure Automation data

Learn how to restore data for Infrastructure Automation components to a cluster, such as for disaster recovery.

The following procedure restores all backed up data that exists in the specified backup for Infrastructure Automation components. The steps in the following procedure restore Infrastructure Automation in a new cluster.

Before you begin

  • All required storage classes must be created before you run the restore process. The storage classes must have the same name as the backup cluster.
  • Custom configuration settings for the Infrastructure Automation - Managed services component, such as Ansible replica count, extra variables, might not be backed up and restored. If you need this data and resources to be included in restored clusters, you need to directly add the data or resources to the restored cluster.
  • You can restore a backup only within an environment that has the same version of Infrastructure Automation as the environment where the backup was created. For example, a backup of an Infrastructure Automation 4.8.1 environment must be restored within a cluster that has Infrastructure Automation 4.8.1 installed. If you need to upgrade as well as restore data, complete the restore process before you upgrade.
  • If you are also restoring IBM Cloud Pak for AIOps data, the overall procedure is mostly the same, with the following additional steps required:
    • After you install IBM Cloud Pak for AIOps on the cluster where you are restoring data, you need to install Infrastructure Automation.

    • After you restore your IBM Cloud Pak for AIOps data, you need to restore your Infrastructure Automation data.

      For more information, see Restoring IBM Cloud Pak for AIOps.

Restore procedure

Follow the steps to restore Infrastructure Automation from backup.

  1. Set up your new cluster for backup and restore.
  2. Prepare the backup data for restoring.
  3. Restore the cluster namespaces and install Infrastructure Automation
  4. Optional. Restore IBM Cloud Pak for AIOps data
  5. Restore the Infrastructure Automation data

If you encounter any issues with the restore process, see Troubleshooting

1. Set up your new cluster for backup and restore

  1. Install Red Hat OpenShift by using the instructions in the Red Hat OpenShift documentation Opens in a new tab.

    IBM Cloud Pak for AIOps requires OpenShift to be installed and running. You must have administrative access to your OpenShift cluster.

    Important: Ensure that the version of Red Hat OpenShift Container Platform that you install is the same as the version that was installed in the backed-up cluster.

    For information about the supported versions of OpenShift, see Supported Red Hat OpenShift Container Platform versions.

    Note: Infrastructure Automation uses the OpenShift image registry when it builds images in real time. If the OpenShift image registry is not persistent and the registry respawns, then workloads can temporarily fail until respawning is complete. A persistent OpenShift image registry is recommended to avoid this issue. For more information, see Setting up and configuring the registry Opens in a new tab in the Red Hat OpenShift Container Platform documentation.

  2. Install the OpenShift command-line interface (oc) on your cluster's boot node and run oc login, using the instructions in Getting started with the Red Hat OpenShift CLI Opens in a new tab.

  3. Configure storage

    The storage configuration must satisfy your sizing requirements. For more information about the storage classes that are needed for installing IBM Cloud Pak for AIOps, see Storage.

    Important: All required storage classes must be created before running the restore process. The storage classes must have the same name as the backup cluster.

  4. Install the backup and restore tools

    Install the Red Hat OpenShift APIs for Data Protection (OADP) in the Red Hat OpenShift Container Platform cluster. For more information, see Installing the backup and restore tools.

    Important: Ensure that the OADP is configured to point to the same object storage (S3 bucket) that includes the backup that you plan to use.

  5. Export the environment variables that you will need for the restore procedure.

    If you are restoring to an online deployment, set the following:

    export PATH=<path>
    export OADP_NAMESPACE=<oadpNamespace>
    

    If you are restoring to an offline deployment, set the following:

    export TARGET_REGISTRY_HOST=<target_registry_host>
    export TARGET_REGISTRY_PORT=<port>
    export TARGET_REGISTRY=$TARGET_REGISTRY_HOST:$TARGET_REGISTRY_PORT
    export TARGET_REGISTRY_USER=<username>
    export TARGET_REGISTRY_PASSWORD=<password>
    export EMAIL=<email>
    export PATH=<path>
    export OADP_NAMESPACE=<oadpNamespace>
    

    Where:

    • <target_registry_host> is the IP address or FQDN of the target registry that holds the backup and restore images, from Offline deployments only: Mirror the backup and restore images
    • <port> is the port_number of the target registry
    • <username> is the username for the target registry
    • <password> is the password for the target registry
    • <email> is the email for the target registry
    • <path> is the path to where you downloaded and extracted the IBM Cloud Pak for AIOps backup and restore files.
    • <oadpNamespace> is the OADP namespace

2. Prepare the backup data for restoring

Verify your backed up data and prepare the data for restoring.

  1. Check the backup status

    Check the backup status to ensure that the backup that you want to restore in your cluster is complete. Run the following command to check the contents of the backup:

    velero describe backup <backup-name> --details
    

    The output should list the backed-up data for Infrastructure Automation and for IBM Cloud Pak foundational services. If you also backed up IBM Cloud Pak for AIOps data, this data (cp4aiops/*) should also be listed.

  2. Package and install the Helm Chart

    1. Change to the restore directory where you need to package the Helm Chart:

      cd ${PATH}/bcdr/4.8.1/restore
      
    2. Update the following parameters in the values.yaml file. The file is located in the ./helm directory:

      • backupName - The name of the backup that you are restoring.
      • aiopsNamespace - The namespace where IBM Cloud Pak for AIOps is installed.
      • csNamespace - The namespace where IBM Cloud Pak foundational services is installed. In IBM Cloud Pak for AIOps v4.8.1 this is the same as the namespace where IBM Cloud Pak for AIOps is installed.
      • oadpNamespace - The namespace where OADP is installed.
    3. Package the Helm Chart.

      helm package ./helm
      
    4. Install the Helm Chart for restoring data by running the following job:

      helm install restore-job clusterrestore-0.1.0.tgz
      

3. Restore the cluster namespaces and install Infrastructure Automation

Since the restore job does not install Infrastructure Automation, you need to first install Infrastructure Automation before you can run the restore jobs for restoring database and component data. A restore job for restoring the cluster namespaces is available and must be run before you install Infrastructure Automation.

  1. Restore the cluster namespaces

    You need to restore the projects (namespaces) of the backed-up cluster so that your new cluster includes the metadata with the SELinux settings that need to match the settings for the backup data that you plan to restore.

    These steps only restore the namespaces and namespace metadata. The commands do not restore the contents of the namespaces.

    Note: In Infrastructure Automation v4.8.1, the IBM Cloud Pak foundational services namespace is the same as the Infrastructure Automation namespace.

    1. Change to the restore directory where the restore script is located:

      cd ${PATH}/bcdr/4.8.1/restore
      
    2. Optional. Delete any existing namespace restore jobs:

      oc delete -f ns-restore-job.yaml
      
    3. Create a job to restore the cluster namespaces:

      oc create -f ns-restore-job.yaml
      
    4. Check the restore job logs by running the following command:

      oc logs -f <ns-restore-job-***>
      
    5. Check the velero-restore status for the namespace by running the following command:

      velero get restore <RESTORE_NAME>
      

      Where <RESTORE_NAME> is the name of the restore for namespace.

      You can view the restore name after the restore job is completed. For example, you might see the restore name aiops-namespace-restore-20221006054710 within the restore job log as follows:

      Restore request "aiops-namespace-restore-20221006054710" submitted successfully.
      

      Ensure that the projects (namespaces) are restored before you proceed.

  2. Optional. Install IBM Cloud Pak for AIOps

    If you also need to restore IBM Cloud Pak for AIOps data, you need to install IBM Cloud Pak for AIOps before you install Infrastructure Automation. For more information, see Installing IBM Cloud Pak for AIOps.

    Important: Ensure that the version of IBM Cloud Pak for AIOps that you are installing is the same as the version that was installed in the backed-up cluster.

  3. Install Infrastructure Automation.

    For more information, see Installing Infrastructure Automation.

    Note: When you install Infrastructure Automation, the Infrastructure Automation operator and the IAConfig CR are created.

    • If you install the Managed services component, the Managed Services operator is also installed and the corresponding custom resource (CR) is automatically created.
    • If you install the Infrastructure management component, only the Infrastructure management operator is installed. The corresponding CR is not created. You do not need to create this CR as it is created during the restore process in the following steps.

    Important:

    • Ensure that the version of IBM Cloud Pak for AIOps that you are installing is the same as the version that was installed in the backed up cluster.
    • The backup includes keys and certificates from the backed-up cluster. Ensure that your new cluster is configured to support the use of these keys and certificates so that the restored data can be accessed.
    • Wait until the installation is complete and all pods in the IBM Cloud Pak for AIOps project (namespace) are running before you proceed with the following restore steps.

4. Optional. Restore the IBM Cloud Pak for AIOps data

If you are also restoring IBM Cloud Pak for AIOps data, run the commands to restore the IBM Cloud Pak for AIOps data before you restore your Infrastructure Automation data. For more information, see Restoring IBM Cloud Pak for AIOps.

5. Restore the Infrastructure Automation data

This procedure restores the data for all backed up Infrastructure Automation databases and components. Optionally, you can choose to restore only individual components. For details, see Restoring individual components.

  1. Change to the restore directory where the restore script is located:

    cd ${PATH}/bcdr/4.8.1/restore
    
  2. Optional. Delete any existing Infrastructure Automation restore jobs:

    oc delete -f ia-restore-job.yaml
    
  3. Create a job to restore Infrastructure Automation:

    oc create -f ia-restore-job.yaml
    
  4. Check the restore job logs by running the following command:

    oc logs -f <ia-restore-job-***>
    
  5. Check the velero-restore status by running the following command:

    velero get restore <RESTORE_NAME>
    

    Where <RESTORE_NAME> is the name of the restore for Infrastructure Automation.

    You can view the restore name after the restore job is completed. For example, you might see the restore name cam-restore-20221006054710 for the Managed services restore in the restore job log as follows:

    Restore request "cam-restore-20221006054710" submitted successfully.
    

    Similarly, the velero restore name for other components can display in the restore job log.

  6. When the restore is completed, and all infrastructure-management pods are in a running started, restart the zen-watcher pod. Run the following command to restart the pod:

    oc delete pod -l app.kubernetes.io/component=zen-watcher -n <aiopsNamespace>
    

    Where aiopsNamespace is the namespace where IBM Cloud Pak for AIOps is installed.

Restoring individual components

If needed, you can choose to restore data for specific individual databases and components instead of for all databases and components at the same time.

  1. Change to the restore directory where the restore script is located:

    cd ${PATH}/bcdr/4.8.1/restore
    
  2. Copy the ia-restore-job.yaml file. Rename your new file <component>-restore-job.yaml, where <component> is the name of the component that you are restoring.

    For example, if you are restoring Managed services, rename the file to be cam-restore-job.yaml. If you are restoring Infrastructure management, rename the file to be im-restore-job.yaml.

  3. Open the new <component>-restore-job.yaml file for editing. Update the name and command sections to match the values for the component that you are restoring:

    • Update the name of the restore job in the metadata section to be the individual component job, such as cam-restore-job.
    • Update the command section command: ["/bin/bash", "restore.sh","-ia"] to replace -ia with the respective argument for the component that you are restoring. For the list of component arguments, see the table that follows this procedure.
  4. Create a job to restore the individual component.

    oc create -f <component>-restore-job.yaml
    
  5. Check the restore job logs by running the following command:

    oc logs -f <cp4waiops-component-restore-job-***>
    
  6. Check the velero-restore status by running the following command:

    velero get restore <RESTORE_NAME>
    

    Where <RESTORE_NAME> is one of the restore names for the component.

    You can view the names for the component in the restore job log when the restore job is completed. For example, the restore name for a Managed services restore can be cam-restore-20221006054710, which can display in an entry similar to the following example log entry:

    Restore request "cam-restore-20221006054710" submitted successfully.
    

Component or database arguments for restore job command configuration

Table. Component or database arguments for restore job commands
Component or Database Argument
Cassandra -cassandra
CouchDB -couchdb
Elasticsearch -es
Metastore -metastore
Minio -minio
Postgres -postgres
IBM Cloud Pak foundational services -cs
Managed services -cam
Infrastructure management -im

Troubleshooting

LDAP user login is not working after a restore

Follow the steps to solve the problem:

  1. Log in to the console as the default admin user.
  2. From the main navigation menu, click Administer > Identify and access.
  3. Select the LDAP connection, and click Edit connection. Edit the LDAP connection with the correct information.
  4. Click Test connection.
  5. Click Save once the connection is success.
  6. log in to the console with the LDAP user's credentials.

Restore process is stuck in an In progress state

If your backups remain stuck in an In progress state for an unexpected duration, complete the following steps. This procedure stops the backup process so that you can try the backup again.

  1. Delete the Velero pod by running the following command:

    oc delete pod <pod> -n <oadpNamespace>
    

    Where <oadpNamespace> is the namespace where OADP is installed, and <pod> is the name of the Velero pod.

  2. Delete the restore that is stuck in the In progress state:

    velero delete restore <restore>
    

    Where <restore> is the restore process that you want to delete. The process should begin again.

  3. Wait for the process to complete and verify that the restored data is available.

Helm install restore job command failed

When you are running the helm install restore-job clusterrestore-0.1.0.tgz command, you might encounter the command failing with an error that is similar to the following error:

Error: admission webhook "trust.hooks.securityenforcement.admission.cloud.ibm.com" denied the request:
Deny "icr.io/cpopen/cp4waiops/cp4aiops-bcdr@sha256:294a42a851a2717ebbc68528ab3c6bcb1ba48114ff058f1c1b537dc6aa167355", no matching repositories in ClusterImagePolicy and no ImagePolicies in the "velero" namespace

If you encounter this error, complete the following steps to resolve the issue:

  1. Uninstall the restore-job job by running the following command:

    helm uninstall restore-job -n $OADP_NAMESPACE
    
  2. Export an environment variable for the image.

    For an online deployment:

    export REGISTRY=icr.io/cpopen/cp4waiops/cp4aiops-bcdr
    export BCDR_IMAGE=${REGISTRY}/<bcdr_image>
    

    For an offline deployment:

    export REGISTRY=$TARGET_REGISTRY
    export BCDR_IMAGE=${REGISTRY}/<bcdr_image>
    

    Where <bcdr_image> is the name of the backup and restore image, as given in the backup helm chart values.yaml file, in the form cp4waiops-bcdr@{digest}. An example value for BCDR_IMAGE is icr.io/cpopen/cp4waiops/cp4waiops-bcdr@sha256:294a42a851a2717ebbc68528ab3c6bcb1ba48114ff058f1c1b537dc6aa167355.

  3. Create a restore-image-policy.yaml file and add the following content within the file:

    apiVersion: securityenforcement.admission.cloud.ibm.com/v1beta1
    kind: ClusterImagePolicy
    metadata:
      name: restore-image-policy
    spec:
     repositories:
      - name: ${BCDR_IMAGE}
        policy:
    
  4. Apply the policy by running the following command:

    oc apply -f restore-image-policy.yaml
    
  5. Deploy the backup job by running the following command:

    helm install restore-job clusterbackup-0.1.0.tgz
    

Restore process terminated mid-process with partial data available

If the restore process does not complete as expected for a data store, such as if it is aborted or terminated during its run, data might not be restored correctly. This incomplete restored data neeeds to be removed before you run the restore process again. To remove the data, you need to run a post-restore cleanup script to clean up the data store.

To run a script, complete the following steps:

  1. Define the following environment variable on your workstation:

    export WORKDIR="<Path>/bcdr/4.8.1/"
    

    Where <Path> is the path to where you downloaded and extracted the IBM Cloud Pak® for AIOps backup and restore files.

  2. Change to the restore directory where the post-restore cleanup script is located:

    cd <Path>/bcdr/4.8.1/restore/<data_store>/
    

    Where

    • <Path> is the path to where you downloaded and extracted the IBM Cloud Pak® for AIOps backup and restore files.
    • <data_store> is the directory for the data store or custom resource that needs to be cleaned up. For example, couchdb is the directory for CouchDb.
  3. Run the post-restore cleanup script:

    nohup ./<data-store>-post-restore.sh.sh > <data-store>-post-restore.log &
    

    Where <data_store> is the data store or custom resource that needs to be cleaned up.

  4. Run the restore process for that data store or resource again.

For example, if the restore job for restoring the Cassandra data store aborted, you need to run the cassandra-post-restore.sh post-restore script that is stored in the bcdr/restore/cassandra directory to clean the data. Then, run the cassandra-native-post-restore.sh post-restore script.

The following table lists the cleanup script to run for each data store:

Table. Cleanup script for components
Component or Database Cleanup script
Cassandra Run first: bcdr/4.8.1/restore/cassandra/cassandra-post-restore.sh
Then, run: bcdr/4.8.1/restore/cassandra/cassandra-native-post-restore.sh
CouchDB bcdr/4.8.1/restore/couchdb/couchdb-post-restore.sh
Elasticsearch bcdr/4.8.1/restore/elasticsearch/es-post-restore.sh
Metastore bcdr/4.8.1/restore/metastore/metastore-post-restore.sh
Minio bcdr/4.8.1/restore/minio/minio-post-restore.sh
Postgres bcdr/4.8.1/restore/postgres/postgres-post-restore.sh
IBM Cloud Pak foundational services bcdr/4.8.1/restore/common-services/cs-post-restore.sh
Managed services bcdr/4.8.1/restore/cam/cam-post-restore.sh
Infrastructure Management bcdr/4.8.1/restore/infrastructure-management/im-cleanup-restore.sh

Managed services instance deployment fails due to a socket hang up

After your restore a Managed services (cam) backup and Managed services is deployed, the Managed services instance can fail due to a socket hang up issue. If this issue ocurs, restart the cam-iaas pod:

oc delete pod <cam-iaas-xxxx> -n <aiopsNamespace>

Where aiopsNamespace is the namespace where IBM Cloud Pak for AIOps is installed.

Infrastructure management pods are not running after a restore

Following a restore, you can notice that the Infrastructure Management pods (prefixed with "1-") are showing as not running.

For example,

oc get pod |grep postgresql
postgresql-6cff46dcdc-g5cn8                                       0/1     Running     0               33m

This issue can occur when Postgres is not fully initialised. To resolve this issue, if the pod is not showing a ready state (1/1), restart the pod manually. This restart enables the Infrastructure Management pods to start.

Infrastructure management URL in the navigation panel points to the cluster where the backup was taken

Following a restore, the Infrastructure Management URL in the navigation panel points to the cluster where the backup was taken. To resolve this issue, restart the zen watcher pod on the restored cluster:

oc delete pod -l app.kubernetes.io/component=zen-watcher 

Troubleshooting the IBM Cloud Pak for AIOps restore

If you are also restoring IBM Cloud Pak for AIOps data and encounter an issue with the restore process for IBM Cloud Pak for AIOps or encounter an issue with data not being available or processed after the restore, see Troubleshooting IBM Cloud Pak for AIOps restore.