Restoring Infrastructure Automation data
Learn how to restore data for Infrastructure Automation components to a cluster, such as for disaster recovery.
The following procedure restores all backed up data that exists in the specified backup for Infrastructure Automation components. The steps in the following procedure restore Infrastructure Automation in a new cluster.
Before you begin
- All required storage classes must be created before you run the restore process. The storage classes must have the same name as the backup cluster.
- Custom configuration settings for the Infrastructure Automation - Managed services component, such as Ansible replica count, extra variables, might not be backed up and restored. If you need this data and resources to be included in restored clusters, you need to directly add the data or resources to the restored cluster.
- You can restore a backup only within an environment that has the same version of Infrastructure Automation as the environment where the backup was created. For example, a backup of an Infrastructure Automation 4.8.0 environment must be restored within a cluster that has Infrastructure Automation 4.8.0 installed. If you need to upgrade as well as restore data, complete the restore process before you upgrade.
- If you are also restoring IBM Cloud Pak for AIOps data, the overall procedure is mostly the same, with the following additional steps required:
-
After you install IBM Cloud Pak for AIOps on the cluster where you are restoring data, you need to install Infrastructure Automation.
-
After you restore your IBM Cloud Pak for AIOps data, you need to restore your Infrastructure Automation data.
For more information, see Restoring IBM Cloud Pak for AIOps.
-
Restore procedure
Follow the steps to restore Infrastructure Automation from backup.
- Set up your new cluster for backup and restore.
- Prepare the backup data for restoring.
- Restore the cluster namespaces and install Infrastructure Automation
- Optional. Restore IBM Cloud Pak for AIOps data
- Restore the Infrastructure Automation data
If you encounter any issues with the restore process, see Troubleshooting
1. Set up your new cluster for backup and restore
-
Install Red Hat OpenShift by using the instructions in the Red Hat OpenShift documentation
.
IBM Cloud Pak for AIOps requires OpenShift to be installed and running. You must have administrative access to your OpenShift cluster.
Important: Ensure that the version of Red Hat OpenShift Container Platform that you install is the same as the version that was installed in the backed-up cluster.
For information about the supported versions of OpenShift, see Supported Red Hat OpenShift Container Platform versions.
Note: Infrastructure Automation uses the OpenShift image registry when it builds images in real time. If the OpenShift image registry is not persistent and the registry respawns, then workloads can temporarily fail until respawning is complete. A persistent OpenShift image registry is recommended to avoid this issue. For more information, see Setting up and configuring the registry
in the Red Hat OpenShift Container Platform documentation.
-
Install the OpenShift command-line interface (
oc
) on your cluster's boot node and runoc login
, using the instructions in Getting started with the Red Hat OpenShift CLI.
-
Configure storage
The storage configuration must satisfy your sizing requirements. For more information about the storage classes that are needed for installing IBM Cloud Pak for AIOps, see Storage.
Important: All required storage classes must be created before running the restore process. The storage classes must have the same name as the backup cluster.
-
Install the backup and restore tools
Install the Red Hat OpenShift APIs for Data Protection (OADP) in the Red Hat OpenShift Container Platform cluster. For more information, see Installing the backup and restore tools.
Important: Ensure that the OADP is configured to point to the same object storage (S3 bucket) that includes the backup that you plan to use.
-
Export the environment variables that you will need for the restore procedure.
If you are restoring to an online deployment, set the following:
export PATH=<path> export OADP_NAMESPACE=<oadpNamespace>
If you are restoring to an offline deployment, set the following:
export TARGET_REGISTRY_HOST=<target_registry_host> export TARGET_REGISTRY_PORT=<port> export TARGET_REGISTRY=$TARGET_REGISTRY_HOST:$TARGET_REGISTRY_PORT export TARGET_REGISTRY_USER=<username> export TARGET_REGISTRY_PASSWORD=<password> export EMAIL=<email> export PATH=<path> export OADP_NAMESPACE=<oadpNamespace>
Where:
<target_registry_host>
is the IP address or FQDN of the target registry that holds the backup and restore images, from Offline deployments only: Mirror the backup and restore images<port>
is the port_number of the target registry<username>
is the username for the target registry<password>
is the password for the target registry<email>
is the email for the target registry<path>
is the path to where you downloaded and extracted the IBM Cloud Pak for AIOps backup and restore files.<oadpNamespace>
is the OADP namespace
2. Prepare the backup data for restoring
Verify your backed up data and prepare the data for restoring.
-
Check the backup status
Check the backup status to ensure that the backup that you want to restore in your cluster is complete. Run the following command to check the contents of the backup:
velero describe backup <backup-name> --details
The output should list the backed-up data for Infrastructure Automation and for IBM Cloud Pak foundational services. If you also backed up IBM Cloud Pak for AIOps data, this data (
cp4aiops/*
) should also be listed. -
Package and install the Helm Chart
-
Change to the
restore
directory where you need to package the Helm Chart:cd ${PATH}/bcdr/4.8.0/restore
-
Update the following parameters in the
values.yaml
file. The file is located in the./helm
directory:backupName
- The name of the backup that you are restoring.aiopsNamespace
- The namespace where IBM Cloud Pak for AIOps is installed.csNamespace
- The namespace where IBM Cloud Pak foundational services is installed. In IBM Cloud Pak for AIOps v4.8.0 this is the same as the namespace where IBM Cloud Pak for AIOps is installed.oadpNamespace
- The namespace where OADP is installed.
-
Package the Helm Chart.
helm package ./helm
-
Install the Helm Chart for restoring data by running the following job:
helm install restore-job clusterrestore-0.1.0.tgz
-
3. Restore the cluster namespaces and install Infrastructure Automation
Since the restore job does not install Infrastructure Automation, you need to first install Infrastructure Automation before you can run the restore jobs for restoring database and component data. A restore job for restoring the cluster namespaces is available and must be run before you install Infrastructure Automation.
-
Restore the cluster namespaces
You need to restore the projects (namespaces) of the backed-up cluster so that your new cluster includes the metadata with the SELinux settings that need to match the settings for the backup data that you plan to restore.
These steps only restore the namespaces and namespace metadata. The commands do not restore the contents of the namespaces.
Note: In Infrastructure Automation v4.8.0, the IBM Cloud Pak foundational services namespace is the same as the Infrastructure Automation namespace.
-
Change to the
restore
directory where the restore script is located:cd ${PATH}/bcdr/4.8.0/restore
-
Optional. Delete any existing namespace restore jobs:
oc delete -f ns-restore-job.yaml
-
Create a job to restore the cluster namespaces:
oc create -f ns-restore-job.yaml
-
Check the restore job logs by running the following command:
oc logs -f <ns-restore-job-***>
-
Check the velero-restore status for the namespace by running the following command:
velero get restore <RESTORE_NAME>
Where
<RESTORE_NAME>
is the name of the restore for namespace.You can view the restore name after the restore job is completed. For example, you might see the restore name
aiops-namespace-restore-20221006054710
within the restore job log as follows:Restore request "aiops-namespace-restore-20221006054710" submitted successfully.
Ensure that the projects (namespaces) are restored before you proceed.
-
-
Optional. Install IBM Cloud Pak for AIOps
If you also need to restore IBM Cloud Pak for AIOps data, you need to install IBM Cloud Pak for AIOps before you install Infrastructure Automation. For more information, see Installing IBM Cloud Pak for AIOps.
- For more information about restoring IBM Cloud Pak for AIOps, see Restoring IBM Cloud Pak for AIOps.
Important: Ensure that the version of IBM Cloud Pak for AIOps that you are installing is the same as the version that was installed in the backed-up cluster.
-
Install Infrastructure Automation.
For more information, see Installing Infrastructure Automation.
Note: When you install Infrastructure Automation, the Infrastructure Automation operator and the IAConfig CR are created.
- If you install the Managed services component, the Managed Services operator is also installed and the corresponding custom resource (CR) is automatically created.
- If you install the Infrastructure management component, only the Infrastructure management operator is installed. The corresponding CR is not created. You do not need to create this CR as it is created during the restore process in the following steps.
Important:
- Ensure that the version of IBM Cloud Pak for AIOps that you are installing is the same as the version that was installed in the backed up cluster.
- The backup includes keys and certificates from the backed-up cluster. Ensure that your new cluster is configured to support the use of these keys and certificates so that the restored data can be accessed.
- Wait until the installation is complete and all pods in the IBM Cloud Pak for AIOps project (namespace) are running before you proceed with the following restore steps.
4. Optional. Restore the IBM Cloud Pak for AIOps data
If you are also restoring IBM Cloud Pak for AIOps data, run the commands to restore the IBM Cloud Pak for AIOps data before you restore your Infrastructure Automation data. For more information, see Restoring IBM Cloud Pak for AIOps.
5. Restore the Infrastructure Automation data
This procedure restores the data for all backed up Infrastructure Automation databases and components. Optionally, you can choose to restore only individual components. For details, see Restoring individual components.
-
Change to the
restore
directory where the restore script is located:cd ${PATH}/bcdr/4.8.0/restore
-
Optional. Delete any existing Infrastructure Automation restore jobs:
oc delete -f ia-restore-job.yaml
-
Create a job to restore Infrastructure Automation:
oc create -f ia-restore-job.yaml
-
Check the restore job logs by running the following command:
oc logs -f <ia-restore-job-***>
-
Check the velero-restore status by running the following command:
velero get restore <RESTORE_NAME>
Where
<RESTORE_NAME>
is the name of the restore for Infrastructure Automation.You can view the restore name after the restore job is completed. For example, you might see the restore name
cam-restore-20221006054710
for the Managed services restore in the restore job log as follows:Restore request "cam-restore-20221006054710" submitted successfully.
Similarly, the velero restore name for other components can display in the restore job log.
-
When the restore is completed, and all
infrastructure-management
pods are in a running started, restart thezen-watcher
pod. Run the following command to restart the pod:oc delete pod -l app.kubernetes.io/component=zen-watcher -n <aiopsNamespace>
Where
aiopsNamespace
is the namespace where IBM Cloud Pak for AIOps is installed.
Restoring individual components
If needed, you can choose to restore data for specific individual databases and components instead of for all databases and components at the same time.
-
Change to the
restore
directory where the restore script is located:cd ${PATH}/bcdr/4.8.0/restore
-
Copy the
ia-restore-job.yaml
file. Rename your new file<component>-restore-job.yaml
, where<component>
is the name of the component that you are restoring.For example, if you are restoring Managed services, rename the file to be
cam-restore-job.yaml
. If you are restoring Infrastructure management, rename the file to beim-restore-job.yaml
. -
Open the new
<component>-restore-job.yaml
file for editing. Update thename
andcommand
sections to match the values for the component that you are restoring:- Update the
name
of the restore job in the metadata section to be the individual component job, such ascam-restore-job
. - Update the command section
command: ["/bin/bash", "restore.sh","-ia"]
to replace-ia
with the respective argument for the component that you are restoring. For the list of component arguments, see the table that follows this procedure.
- Update the
-
Create a job to restore the individual component.
oc create -f <component>-restore-job.yaml
-
Check the restore job logs by running the following command:
oc logs -f <cp4waiops-component-restore-job-***>
-
Check the velero-restore status by running the following command:
velero get restore <RESTORE_NAME>
Where
<RESTORE_NAME>
is one of the restore names for the component.You can view the names for the component in the restore job log when the restore job is completed. For example, the restore name for a Managed services restore can be
cam-restore-20221006054710
, which can display in an entry similar to the following example log entry:Restore request "cam-restore-20221006054710" submitted successfully.
Component or database arguments for restore job command configuration
Component or Database | Argument |
---|---|
Cassandra |
-cassandra |
CouchDB |
-couchdb |
Elasticsearch |
-es |
Metastore |
-metastore |
Minio |
-minio |
Postgres |
-postgres |
IBM Cloud Pak foundational services | -cs |
Managed services | -cam |
Infrastructure management | -im |
Troubleshooting
- LDAP user login is not working after a restore
- Restore process is stuck in an In progress state
- Helm install restore job command failed
- Restore process terminated mid-process with partial data available
- Managed services instance deployment fails due to a socket hang up
- Infrastructure management pods are not running after a restore
- Infrastructure management URL in the navigation panel points to the cluster where the backup was taken
- Troubleshooting the IBM Cloud Pak for AIOps restore
LDAP user login is not working after a restore
Follow the steps to solve the problem:
- Log in to the console as the default admin user.
- From the main navigation menu, click Administer > Identify and access.
- Select the LDAP connection, and click Edit connection. Edit the LDAP connection with the correct information.
- Click Test connection.
- Click Save once the connection is success.
- log in to the console with the LDAP user's credentials.
Restore process is stuck in an In progress state
If your backups remain stuck in an In progress state for an unexpected duration, complete the following steps. This procedure stops the backup process so that you can try the backup again.
-
Delete the Velero pod by running the following command:
oc delete pod <pod> -n <oadpNamespace>
Where
<oadpNamespace>
is the namespace where OADP is installed, and<pod>
is the name of the Velero pod. -
Delete the restore that is stuck in the In progress state:
velero delete restore <restore>
Where
<restore>
is the restore process that you want to delete. The process should begin again. -
Wait for the process to complete and verify that the restored data is available.
Helm install restore job command failed
When you are running the helm install restore-job clusterrestore-0.1.0.tgz
command, you might encounter the command failing with an error that is similar to the following error:
Error: admission webhook "trust.hooks.securityenforcement.admission.cloud.ibm.com" denied the request:
Deny "icr.io/cpopen/cp4waiops/cp4aiops-bcdr@sha256:294a42a851a2717ebbc68528ab3c6bcb1ba48114ff058f1c1b537dc6aa167355", no matching repositories in ClusterImagePolicy and no ImagePolicies in the "velero" namespace
If you encounter this error, complete the following steps to resolve the issue:
-
Uninstall the
restore-job
job by running the following command:helm uninstall restore-job -n $OADP_NAMESPACE
-
Export an environment variable for the image.
For an online deployment:
export REGISTRY=icr.io/cpopen/cp4waiops/cp4aiops-bcdr export BCDR_IMAGE=${REGISTRY}/<bcdr_image>
For an offline deployment:
export REGISTRY=$TARGET_REGISTRY export BCDR_IMAGE=${REGISTRY}/<bcdr_image>
Where
<bcdr_image>
is the name of the backup and restore image, as given in the backup helm chartvalues.yaml
file, in the formcp4waiops-bcdr@{digest}
. An example value for BCDR_IMAGE isicr.io/cpopen/cp4waiops/cp4waiops-bcdr@sha256:294a42a851a2717ebbc68528ab3c6bcb1ba48114ff058f1c1b537dc6aa167355
. -
Create a
restore-image-policy.yaml
file and add the following content within the file:apiVersion: securityenforcement.admission.cloud.ibm.com/v1beta1 kind: ClusterImagePolicy metadata: name: restore-image-policy spec: repositories: - name: ${BCDR_IMAGE} policy:
-
Apply the policy by running the following command:
oc apply -f restore-image-policy.yaml
-
Deploy the backup job by running the following command:
helm install restore-job clusterbackup-0.1.0.tgz
Restore process terminated mid-process with partial data available
If the restore process does not complete as expected for a data store, such as if it is aborted or terminated during its run, data might not be restored correctly. This incomplete restored data neeeds to be removed before you run the restore process again. To remove the data, you need to run a post-restore cleanup script to clean up the data store.
To run a script, complete the following steps:
-
Define the following environment variable on your workstation:
export WORKDIR="<Path>/bcdr/4.8.0/"
Where
<Path>
is the path to where you downloaded and extracted the IBM Cloud Pak® for AIOps backup and restore files. -
Change to the
restore
directory where the post-restore cleanup script is located:cd <Path>/bcdr/4.8.0/restore/<data_store>/
Where
<Path>
is the path to where you downloaded and extracted the IBM Cloud Pak® for AIOps backup and restore files.<data_store>
is the directory for the data store or custom resource that needs to be cleaned up. For example,couchdb
is the directory for CouchDb.
-
Run the post-restore cleanup script:
nohup ./<data-store>-post-restore.sh.sh > <data-store>-post-restore.log &
Where
<data_store>
is the data store or custom resource that needs to be cleaned up. -
Run the restore process for that data store or resource again.
For example, if the restore job for restoring the Cassandra data store aborted, you need to run the cassandra-post-restore.sh
post-restore script that is stored in the bcdr/restore/cassandra
directory to clean the
data. Then, run the cassandra-native-post-restore.sh
post-restore script.
The following table lists the cleanup script to run for each data store:
Component or Database | Cleanup script |
---|---|
Cassandra |
Run first: bcdr/4.8.0/restore/cassandra/cassandra-post-restore.sh Then, run: bcdr/4.8.0/restore/cassandra/cassandra-native-post-restore.sh |
CouchDB |
bcdr/4.8.0/restore/couchdb/couchdb-post-restore.sh |
Elasticsearch |
bcdr/4.8.0/restore/elasticsearch/es-post-restore.sh |
Metastore |
bcdr/4.8.0/restore/metastore/metastore-post-restore.sh |
Minio |
bcdr/4.8.0/restore/minio/minio-post-restore.sh |
Postgres |
bcdr/4.8.0/restore/postgres/postgres-post-restore.sh |
IBM Cloud Pak foundational services | bcdr/4.8.0/restore/common-services/cs-post-restore.sh |
Managed services | bcdr/4.8.0/restore/cam/cam-post-restore.sh |
Infrastructure Management | bcdr/4.8.0/restore/infrastructure-management/im-cleanup-restore.sh |
Managed services instance deployment fails due to a socket hang up
After your restore a Managed services (cam
) backup and Managed services is deployed, the Managed services instance can fail due to a socket hang up issue. If this issue ocurs, restart the cam-iaas
pod:
oc delete pod <cam-iaas-xxxx> -n <aiopsNamespace>
Where aiopsNamespace
is the namespace where IBM Cloud Pak for AIOps is installed.
Infrastructure management pods are not running after a restore
Following a restore, you can notice that the Infrastructure Management pods (prefixed with "1-") are showing as not running.
For example,
oc get pod |grep postgresql
postgresql-6cff46dcdc-g5cn8 0/1 Running 0 33m
This issue can occur when Postgres is not fully initialised. To resolve this issue, if the pod is not showing a ready state (1/1
), restart the pod manually. This restart enables the Infrastructure Management pods to start.
Infrastructure management URL in the navigation panel points to the cluster where the backup was taken
Following a restore, the Infrastructure Management URL in the navigation panel points to the cluster where the backup was taken. To resolve this issue, restart the zen watcher pod on the restored cluster:
oc delete pod -l app.kubernetes.io/component=zen-watcher
Troubleshooting the IBM Cloud Pak for AIOps restore
If you are also restoring IBM Cloud Pak for AIOps data and encounter an issue with the restore process for IBM Cloud Pak for AIOps or encounter an issue with data not being available or processed after the restore, see Troubleshooting IBM Cloud Pak for AIOps restore.