Restoring IBM Cloud Pak for AIOps data

Learn how to restore data for IBM Cloud Pak for AIOps components to a cluster, such as for disaster recovery.

The following procedure restores all backed up data that exists in the specified backup for IBM Cloud Pak for AIOps components. The steps in the following procedure restore IBM Cloud Pak for AIOps in a new cluster.

Before you begin

  • All required storage classes must be created prior to running the restore process. The storage classes must have the same name as the backup cluster.
  • You can restore a backup only within an environment that has the same version of IBM Cloud Pak for AIOps as the environment where the backup was created. For example, a backup of an IBM Cloud Pak for AIOps 4.9.0 environment must be restored within a cluster that has IBM Cloud Pak for AIOps 4.9.0 installed. If you need to upgrade as well as restore data, complete the restore process before you upgrade.
  • If you are also restoring Infrastructure Automation data, the overall procedure is the same with the following additional steps required:
    • After you install IBM Cloud Pak for AIOps on the cluster where you are restoring data, you need to install Infrastructure Automation.
    • After you restore your IBM Cloud Pak for AIOps data, you need to restore your Infrastructure Automation data. For more information, see Restoring Infrastructure Automation.

Restore procedure

Follow the steps to restore IBM Cloud Pak for AIOps from backup.

  1. Set up your new cluster for restore
  2. Prepare the backup data for restoring
  3. Restore the cluster namespaces and install IBM Cloud Pak for AIOps
  4. Restore the IBM Cloud Pak for AIOps data
  5. (Optional) Restore Infrastructure Automation data
  6. Post-restore tasks

If you encounter any issues with the restore process, see Troubleshooting

1. Set up your new cluster for restore

  1. Install Red Hat OpenShift by using the instructions in the Red Hat OpenShift documentation Opens in a new tab.

    IBM Cloud Pak for AIOps requires OpenShift to be installed and running. You must have administrative access to your OpenShift cluster.

    Important: Ensure that the version of Red Hat OpenShift Container Platform that you install is the same as the version that was installed in the backed up cluster.

    For information on supported versions of OpenShift, see Supported Red Hat OpenShift Container Platform versions.

    Note: IBM Cloud Pak for AIOps uses the OpenShift image registry when it builds images in real time. If the OpenShift image registry is not persistent and the registry respawns, then workloads can temporarily fail until respawning is complete. A persistent OpenShift image registry is recommended to avoid this. For more information, see Setting up and configuring the registry Opens in a new tab in the Red Hat OpenShift Container Platform documentation.

  2. Install the OpenShift command line interface (oc) on your cluster's boot node and run oc login, using the instructions in Getting started with the OpenShift CLI Opens in a new tab.

  3. Configure storage

    The storage configuration must satisfy your sizing requirements. For more information on the storage classes that are needed for installing IBM Cloud Pak for AIOps, see Storage.

    Important: All required storage classes must be created prior to running the restore process. The storage classes must have the same name as the backup cluster.

  4. Install the backup and restore tools

    Install the Red Hat OpenShift APIs for Data Protection (OADP) in the Red Hat OpenShift Container Platform cluster. For more information, see Installing the backup and restore tools.

    Important: Ensure that the OpenShift APIs for Data Protection (OADP) is configured to point to the same object storage (S3 bucket) that includes the backup that you plan to use.

  5. Export the environment variables that you will need for the restore procedure.

    If you are restoring to an online deployment, set the following:

    export PATH=<path>
    export OADP_NAMESPACE=<oadpNamespace>
    export AIOPS_NAMESPACE=<aiops_namespace>
    

    If you are restoring to an offline deployment, set the following:

    export TARGET_REGISTRY_HOST=<target_registry_host>
    export TARGET_REGISTRY_PORT=<port>
    export TARGET_REGISTRY=$TARGET_REGISTRY_HOST:$TARGET_REGISTRY_PORT
    export TARGET_REGISTRY_USER=<username>
    export TARGET_REGISTRY_PASSWORD=<password>
    export EMAIL=<email>
    export PATH=<path>
    export OADP_NAMESPACE=<oadpNamespace>
    export AIOPS_NAMESPACE=<aiops_namespace>
    

    Where:

    • <target_registry_host> is the IP address or FQDN of the target registry that holds the backup and restore images, from Offline deployments only: Mirror the backup and restore images
    • <port> is the port_number of the target registry
    • <username> is the username for the target registry
    • <password> is the password for the target registry
    • <email> is the email for the target registry
    • <path> is the path to where you downloaded and extracted the IBM Cloud Pak for AIOps backup and restore files.
    • <oadpNamespace> is the OADP namespace
    • <aiops_namespace> is the namespace for the IBM Cloud Pak for AIOps deployment.
  6. Run the following command to uninstall any helm charts that were previously used for the restore of another instance or version of IBM Cloud Pak for AIOps.

    helm uninstall restore-job -n ${OADP_NAMESPACE}
    

2. Prepare the backup data for restoring

Verify your backed up data and prepare the data for restoring.

  1. Check the backup status

    Check the backup status to ensure that the backup that you want to restore in your cluster is complete. Run the following command to check the contents of the backup:

    velero describe backup <backup-name> --details
    

    The output should list the backed-up data for IBM Cloud Pak for AIOps (cp4aiops/*) and for IBM Cloud Pak foundational services. If you also backed up Infrastructure Automation data, this data (cp4aiops/*) should also be listed.

    The output should resemble the following sample output:

    v1/PersistentVolumeClaim:
      - cp4aiops/back-aiops-topology-cassandra-0
      - cp4aiops/data-c-example-couchdbcluster-m-0
      - cp4aiops/export-aimanager-ibm-minio-0
      - cp4aiops/aiops-ibm-elasticsearch-es-server-snap
      - cp4aiops/postgres-backup-data
      - cp4aiops/metastore-backup-data
    v1/Pod:
      - cp4aiops/backup-back-aiops-topology-cassandra-0
      - cp4aiops/backup-data-c-example-couchdbcluster-m-0
      - cp4aiops/backup-export-aimanager-ibm-minio-0
      - cp4aiops/backup-metastore
      - cp4aiops/es-backup
      - cp4aiops/backup-metastore
      - cp4aiops/dummy-db
    v1/Secret:
      - cp4aiops/aimanager-ibm-minio-access-secret
      - cp4aiops/aiops-ir-core-model-secret
      - cp4aiops/icp-serviceid-apikey-secret
    
    Velero-Native Snapshots: <none included>
    
    Restic Backups:
     Completed:
      cp4aiops/backup-back-aiops-topology-cassandra-0: backup
      cp4aiops/backup-data-c-example-couchdbcluster-m-0: backup
      cp4aiops/backup-export-aimanager-ibm-minio-0: backup
      cp4aiops/backup-metastore: data
      cp4aiops/es-backup: elasticsearch-backups
      cp4aiops/backup-postgres: backup
    

  2. Package and install the Helm Chart

    1. Change to the restore directory where you need to package the Helm Chart:

      cd ${PATH}/bcdr/4.9.0/restore
      
    2. Update the following parameters in the values.yaml file. The file is located in the ./helm directory:

      • backupName - The name of the backup that you are restoring.
      • aiopsNamespace - The namespace where IBM Cloud Pak for AIOps is installed.
      • csNamespace - The namespace where IBM Cloud Pak foundational services is installed. In IBM Cloud Pak for AIOps v4.9.0 this is the same as the namespace where IBM Cloud Pak for AIOps is installed.
      • oadpNamespace - The namespace where OADP is installed.
    3. Package the Helm Chart.

      helm package ./helm
      
    4. Install the Helm Chart for restoring data by running the following job:

      helm install restore-job clusterrestore-0.1.0.tgz
      

3. Restore the cluster namespaces and install IBM Cloud Pak for AIOps

Since the restore job does not install IBM Cloud Pak for AIOps, you need to first install IBM Cloud Pak for AIOps before you can run the restore jobs for restoring database and component data. A restore job for restoring the cluster namespaces is available and must be run before you install IBM Cloud Pak for AIOps.

  1. Restore the cluster namespaces

    You need to restore the projects (namespaces) of the backed up cluster so that your new cluster includes the metadata with the SELinux settings that need to match the settings for the backup data that you plan to restore.

    These steps only restore the namespaces and namespace metadata. The commands do not restore the contents of the namespaces.

    Note: In IBM Cloud Pak for AIOps v4.9.0, the IBM Cloud Pak foundational services namespace is the same as the IBM Cloud Pak for AIOps namespace.

    1. Change to the restore directory where the restore script is located:

      cd ${PATH}/bcdr/4.9.0/restore
      
    2. Optional. Delete any existing namespace restore jobs:

      oc delete -f ns-restore-job.yaml
      
    3. Create a job to restore the cluster namespaces:

      oc create -f ns-restore-job.yaml
      
    4. Check the restore job logs by running the following command:

      oc logs -f <ns-restore-job-***>
      
    5. Check the velero-restore status for the namespace by running the following command:

      velero get restore <RESTORE_NAME>
      

      Where <RESTORE_NAME> is the name of the restore for namespace. You can see the restore name after the restore job is completed. For example, you might see the restore name aiops-namespace-restore-20221006054710 within the restore job log as follows:

      Restore request "aiops-namespace-restore-20221006054710" submitted successfully.
      

      Ensure that the projects (namespaces) are restored before you proceed.

  2. Create a network policy.

    For more information about creating a NetworkPolicy, see Creating a network policyOpens in a new tab.

    1. Create a network policy file called policy-bcdr.yaml with the following contents:

      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
       name: bcdr-np
       namespace: <aiopsNamespace>
      spec:
       podSelector:
        matchLabels:
         ibm-es-server: aiops-ibm-elasticsearch-es-server
       ingress:
       - from:
         - namespaceSelector:
             matchLabels:
              kubernetes.io/metadata.name: <oadpNamespace>
       policyTypes:
       - Ingress
      

      Where:

      • <aiopsNamespace> is the value of $AIOPS_NAMESPACE
      • <oadpNamespace> is the value of $OADP_NAMESPACE
    2. Apply the network policy to your cluster to allow the required ingress traffic.

      oc apply -f policy-bcdr.yaml -n ${AIOPS_NAMESPACE}
      
  3. Install IBM Cloud Pak for AIOps

    For more information, see Installing IBM Cloud Pak for AIOps.

    Important:

    • Ensure that the version of IBM Cloud Pak for AIOps that you are installing is the same as the version that was installed in the backed up cluster.
    • The backup includes keys and certificates from the backed up cluster. Ensure that your new cluster is configured to support the use of these keys and certificates so that the restored data can be accessed.
    • Wait until the installation is complete and all pods in the IBM Cloud Pak for AIOps project (namespace) are running before you proceed.
  4. Optional. If you also need to restore Infrastructure Automation data, you need to install the Infrastructure Automation operators and create the required custom resources.

4. Restore the IBM Cloud Pak for AIOps data

This procedure restore the data for all backed up IBM Cloud Pak for AIOps databases and components. Optionally, you can choose to restore only individual components. For details, see Restoring individual components.

  1. Change to the restore directory where the restore script is located:

    cd ${PATH}/bcdr/4.9.0/restore
    
  2. Optional. Delete any existing IBM Cloud Pak for AIOps restore jobs:

    oc delete -f aiops-restore-job.yaml
    
  3. Create a job to restore IBM Cloud Pak for AIOps:

    oc create -f aiops-restore-job.yaml
    
  4. Check the restore job logs by running the following command:

    oc logs -f <aiops-restore-job-***>
    
  5. Check the velero-restore status by running the following command:

    velero get restore <RESTORE_NAME>
    

    Where <RESTORE_NAME> is the name of the restore for IBM Cloud Pak for AIOps. You can see the this restore name after the restore job is completed. For example, you might see the restore name cassandra-restore-20221006054710 for the Cassandra restore in the restore job log as follows:

    Restore request "cassandra-restore-20221006054710" submitted successfully.
    

    Similarly, the OADP restore name for other IBM Cloud Pak for AIOps components can display in the restore job log.

    Note: You might notice that the aiops-configmaps resource group shows as failed under the Restore sequence section of the IBM Cloud Pak for AIOps restore recipe but the overall restore is successful and works as intended.

5. (Optional) Restore the Infrastructure Automation data

If you are also restoring Infrastructure Automation data, run the commands to restore the Infrastructure Automation data. For more information, see Restoring Infrastructure Automation.

6. Post-restore tasks

  1. Update your integrations.

    If you have an integration that you are restoring, the status for these integrations can be in error after the restore process completes. To resolve this status, you need to edit and save your integrations with the Integrations feature in the IBM Cloud Pak for AIOps console. Editing the integration regenerates the associated Flink job for the integration, which updates the status. For more information about editing these integrations, see Defining integrations. Sometimes an integration can fail to gather data but not show an error status. To resolve this, edit and save the integration.

  2. If you are restoring data for any integrations that are remotely deployed, complete the following steps to redeploy the integrations:

    1. On the remote cluster, delete any existing integrations.

    2. For each remotely deployed integration that you need to redeploy, Copy or download a new bootstrap command for the deployment script from cluster where the restore process ran.

      You can click the Copy to clipboard button to copy the command, or click Download as a sh-file to download the command as a bootstrap.sh file.

      Note: If you do not obtain the command now, you can copy or download the command later by completing the following steps:

      1. Log in to IBM Cloud Pak for AIOps console.

      2. Expand the navigation menu (four horizontal bars), then click Define > Integrations.

      3. For each integration, click the integration type on the Manage integrations tab of the Integrations page.

      4. On the page for the integration type, click the Download link in the Remote deployment script column for the integrations.

      5. Either copy or download the bootstrap command for the deployment script.

    3. Run the remote deployment scripts on the remote cluster to redeploy the integrations.

      1. Get the OpenShift CLI oc login command from the remote cluster.

      2. From a command line, log in to the remote cluster with the oc login command.

      3. Switch to the target project (namespace).

      4. Run the deployment script as a sh (script file), such as the downloadable bootstap.sh file.

  3. If you are resoring secure tunnels data, install the secure tunnel connections for any restored secure tunnel data.

    1. Optional. If the tunnel exists on your cluster, uninstall the existing tunnel connection.

    2. Install the Secure Tunnel connector on your restored cluster.

    3. Create the secure tunnel connection on your restored cluster.

  4. Enable authentication with the restored JWT certificate.

    1. Retrieve the new initial_admin_password from the admin-user-details secret.

      export PROJECT=<project>
      PASS=$(oc get secret admin-user-details -n ${PROJECT} -o jsonpath='{.data.initial_admin_password}' | base64 -d)
      
    2. Disable the admin user.

      oc rsh -n ${PROJECT} $(oc get pod -l component=usermgmt -n ${PROJECT} | tail -1 | cut -f1 -d\ ) /usr/src/server-src/scripts/manage-user.sh --disable-user admin
      
    3. Enable the admin user with the new password.

      echo "$PASS" | oc rsh -n ${PROJECT} $(oc get pod -l component=usermgmt -n ${PROJECT} | tail -1 | cut -f1 -d\ ) /usr/src/server-src/scripts/manage-user.sh --enable-user admin
      
    4. Delete the secret app-api-user-jwt. It will be recreated automatically after a few seconds.

      oc delete secret app-api-user-jwt -n ${PROJECT}
      

Restoring individual components

If needed, you can choose to restore data for specific individual databases and components instead of for all databases and components at once.

  1. Change to the restore directory where the restore script is located:

    cd ${PATH}/bcdr/4.9.0/restore
    
  2. Copy the aiops-restore-job.yaml file. Rename your new file <component>-restore-job.yaml, where <component> is the name of the component that you are restoring.

    For example, if you are restoring Cassandra, rename the file to be cassandra-restore-job.yaml.

  3. Open the new <component>-restore-job.yaml file for editing. Update the name and command sections to match the values for the component that you are restoring:

    • Update the name of the restore job in the metadata section to be the individual component job, such as cassandra-restore-job.
    • Update the command section command: ["/bin/bash", "restore.sh","-aiops"] to replace -aiops with the respective argument for the component that you are restoring. For the list of component arguments, see the table that follows this procedure.
  4. Create a job to restore the individual component.

    oc create -f <component>-restore-job.yaml
    
  5. Check the restore job logs by running the following command:

    oc logs -f <cp4waiops-component-restore-job-***>
    
  6. Check the velero-restore status by running the following command:

    velero get restore <RESTORE_NAME>
    

    Where <RESTORE_NAME> is one of the restore names for the component. You can view the names for the component in the restore job log when the restore job is completed. For example, the restore name for a Cassandra restore can be cassandra-restore-20221006054710, which can display in an entry similar to the following example log entry:

    Restore request "cassandra-restore-20221006054710" submitted successfully.
    

Component or database arguments for restore job command configuration

Table. Component or database arguments for restore job commands
Component or Database Argument
Cassandra -cassandra
CouchDB -couchdb
Elasticsearch -es
Metastore -metastore
Minio -minio
Postgres -postgres
IBM Cloud Pak foundational services -cs
Intergration CR -connectioncr
Secure Tunnel CR -tunnelcr

Troubleshooting

Stale alert data exists for topology resources after restoration

After you complete a restore, you might notice notice that stale alert data exists for topology resources. For instance, if you search for a resource that had alert data, you might see outdated alert data. To clear the historical alert data, you can run the statusClear crawler.

You can run the crawler to clear stale events by using the topology service Swagger pages:

  1. Open the Swagger API for the topology service. For more information, see Accessing Topology service Swagger UI.
  2. Go to the Crawlers section.
  3. Open the statusClear crawler.
  4. Run the crawler with the default message body.

This crawler runs asynchronously over the topology data. The crawler POST response returns an EntityId header, which you can use to check the progress of the crawler. To check the progress include the EntityId header in a GET /mgmt_artifacts/{id} call. The response shows the status of the asynchronous crawl.

For more information about the topology service Swagger, see Application and topology APIs.

Restore process for ElasticSearch failed

If the restore of a particular data store, such as ElasticSearch, or custom resource fails, complete the following steps before you can attempt the restore again.

  1. Optional. If this cluster was previously configured to enable a backup of ElasticSearch, or you need to rerun restore, complete the following steps:

    1. Remove the backup path and Snapshot location configurations from the AutomationBase CR.

    2. Delete the ElasticSearch backup snapshot PVC by running the following command. This deletion is needed since this PCV is replaced with the ElasticSearch backup data and the es-restore1.

      elasticsearch_cluster=$(kubectl get elasticsearchclusters.elasticsearch.opencontent.ibm.com -n ${AIOPS_NAMESPACE} -o jsonpath='{.items[0].metadata.name}')
      esbackupPVC="$elasticsearch_cluster-ibm-elasticsearch-es-server-snap"
      
      oc delete pvc -n ${AIOPS_NAMESPACE} $esbackupPVC
      
    3. Run the following command to delete the snapshot and es-restore1 if it exists:

      oc delete restore es-restore1 -n $OADP_NAMESPACE
      
  2. Repeat the process to restore the data store or custom resource to reattempt the failed restore.

Restore fails with CouchDB pod in CrashLoopBackOff

This problem can occur if you restore an IBM Cloud Pak for AIOps backup to a different cluster or namespace and the database files in the persistent storage are not readable or writeable by the root user group.

When this issue occurs, the pod c-example-couchdbcluster-m has a status of CrashLoopBackOff, as in the following example.

oc get pods | grep couchdb
c-example-couchdbcluster-m-0                                      1/2     CrashLoopBackOff        533 (4m13s ago)   2d5h
c-example-couchdbcluster-m-0-debug                                1/2     Running                 0                 2m39s

The logs from the database container in the failing pods contain the message Could not open file /data/db/_nodes.couch: permission denied, as in the following example:

oc logs c-example-couchdbcluster-m-0 -c db --tail=100
[error] 2022-11-10T10:39:34.788855Z couchdb@c-example-couchdbcluster-m-0.c-example-couchdbcluster-m <0.322.0> -------- Could not open file /data/db/_nodes.couch: permission denied

To resolve this issue, grant the root group write permissions to the folders in the persistent storage that contain the database files.

  1. Get the name of the persistent volume that contains the CouchDB database files.

    oc get pv | grep couch
    

    Example output:

    oc get pv | grep couch
    pvc-d6599cfb-aa38-47be-857e-a2efa9ee0407   5Gi        RWO            Delete           Bound    katamari/data-c-example-couchdbcluster-m-0                             rook-cephfs              2d5h
    
  2. Find the worker node for the failing CouchDB pod.

    oc get pod -o wide c-example-couchdbcluster-m-0
    

    Example output:

    oc get pod -o wide c-example-couchdbcluster-m-0
    NAME                           READY   STATUS             RESTARTS        AGE    IP             NODE                                        NOMINATED NODE   READINESS GATES
    c-example-couchdbcluster-m-0   1/2     CrashLoopBackOff   537 (76s ago)   2d5h   10.254.15.54   worker1.bcdr-test-12345678.mysite.com   <none>           <none>
    
  3. Debug the worker node that you identified in the previous step.

    oc debug node/<worker_node>
    

    Where <worker_node> is the value returned for NODE in the previous step.

    Example output:

    oc debug node/worker1.bcdr-test-12345678.mysite.com
    Starting pod/worker1bcdr-rtp-26101516cpmyclustercom-debug ...
    To use host binaries, run `chroot /host`
    Pod IP: 10.22.1.1
    If you don't see a command prompt, try pressing enter.
    
  4. Find the mount point of the persistent volume.

    mount | grep <pv>
    

    Where <pv> is the persistent volume name returned in step 1.

    Example output:

    sh-4.4# mount | grep pvc-d6599cfb-aa38-47be-857e-a2efa9ee0407
    172.30.2.127:6789,172.30.43.40:6789,172.30.11.17:6789:/volumes/csi/csi-vol-0bb3b8e3-5f26-11ed-97a6-0a580afe2807/af21d8e8-f29e-4111-8fcd-c4366d604c99 on /host/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d6599cfb-aa38-47be-857e-a2efa9ee0407/globalmount type ceph (rw,relatime,seclabel,name=csi-cephfs-node,secret=<hidden>,acl,mds_namespace=rook-cephfilesystem)
    172.30.2.127:6789,172.30.43.40:6789,172.30.11.17:6789:/volumes/csi/csi-vol-0bb3b8e3-5f26-11ed-97a6-0a580afe2807/af21d8e8-f29e-4111-8fcd-c4366d604c99 on /host/var/lib/kubelet/pods/26fc2b39-aaf0-4add-8649-6cfd56fc4a3b/volumes/kubernetes.io~csi/pvc-d6599cfb-aa38-47be-857e-a2efa9ee0407/mount type ceph (rw,relatime,seclabel,name=csi-cephfs-node,secret=<hidden>,acl,mds_namespace=rook-cephfilesystem)
    

    The mount point is the directory ending with mount. In the above example output this is /host/var/lib/kubelet/pods/26fc2b39-aaf0-4add-8649-6cfd56fc4a3b/volumes/kubernetes.io~csi/pvc-d6599cfb-aa38-47be-857e-a2efa9ee0407/mount

  5. Change the permissions on the mount directory by changing to the directory and running chmod.

    cd <mount_dir>
    chmod -R g+w
    ls -lah db/
    exit
    

    Where <mount_dir> is the directory you identified in the previous step.

    Example output:

    sh-4.4# cd /host/var/lib/kubelet/pods/26fc2b39-aaf0-4add-8649-6cfd56fc4a3b/volumes/kubernetes.io~csi/pvc-d6599cfb-aa38-47be-857e-a2efa9ee0407/mount
    sh-4.4# ls
    db  view_index
    sh-4.4# ls -lah db/
    total 37K
    drwxr-xr-x. 4 1000650000 root    4 Nov  7 04:55 .
    drwxrwxrwx. 5 root       root    3 Nov 10 10:52 ..
    drwxr-xr-x. 2 1000650000 root    0 Nov  7 04:54 .delete
    -rw-r--r--. 1 1000650000 root  29K Nov  7 05:31 _dbs.couch
    -rw-r--r--. 1 1000650000 root 8.3K Nov  7 04:54 _nodes.couch
    drwxr-xr-x. 4 1000650000 root    2 Nov  7 04:55 shards
    sh-4.4# chmod -R g+w .
    sh-4.4# ls -lah db/
    total 37K
    drwxrwxr-x. 4 1000650000 root    4 Nov  7 04:55 .
    drwxrwxrwx. 5 root       root    3 Nov 10 10:52 ..
    drwxrwxr-x. 2 1000650000 root    0 Nov  7 04:54 .delete
    -rw-rw-r--. 1 1000650000 root  29K Nov  7 05:31 _dbs.couch
    -rw-rw-r--. 1 1000650000 root 8.3K Nov  7 04:54 _nodes.couch
    drwxrwxr-x. 4 1000650000 root    2 Nov  7 04:55 shards
    sh-4.4# exit
    
  6. Run the following command to verify that the CouchDB pod has a status of Running.

    oc get pods | grep couchdb
    

    Example output:

    c-example-couchdbcluster-m-0                                      2/2     Running                 539 (87s ago)     2d5h
    

LDAP user login is not working after a restore

Follow the steps to solve the problem:

  1. Log in to the console as the default admin user.
  2. From the main navigation menu, click Administer > Identify and access.
  3. Select the LDAP connection, and click Edit connection. Edit the LDAP connection with the correct information.
  4. Click Test connection.
  5. Click Save once the connection is success.
  6. log in to the console with the LDAP user's credentials.

Data is not being processed after restoration

After you complete a restore, you might notice that data for some AI modeling features, such as log anomaly detection, change risk, similar tickets, might not be processed.

This can occur when expected Kafka topics for the AI modeling is not present after the restore completes. To resolve this problem, restart the aimanager-aio-controller pod. This restart can result in the creation of the expected Kafka topics.

Applications still show non-existent active incidents and alerts notations

After you complete a restore, you might notice notice that active incidents and alerts notations display for applications when the incidents and notations no longer exist. When the topology data store is backed up and restored, it contains restored data which is out of sync with the rest of the system due to the way the data is modelled against groups, applications, and resources within the topology.

If your data is outdated, you need to manually delete the topology association between the incidents and applications. Use the topology API to clear out any groups or events. Once removed, the changes are incorporated into the topology view.

To remove groups (incident groups of entity waiopsStory) use the API:

  • GET /topology/groups?_type=waiopsStory
  • DELETE /topology/groups?_type=waiopsStory

To remove events, remove the incorrect status from the topology with the following API:

  • GET /topology/status
  • DELETE /topology/status

Alternatively, you can use the Resource management tool in the UI to remove non-existant alert indiciations from a topology. When you are viewing a topology in the UI, open the Settings menu and select Topology configuration. From theData administration routines page, run the Status clear routine to remove the alert indications.

Topology observer jobs do not run after restore

After running the restore procedure, some observer jobs are offline, have a Status of Error, or have a Status of Running even though job runs are being skipped.

For jobs with a job type of Repeating Schedule, this problem can occur when a restore ran from a backup that was taken while the observer job was waiting for data processing to finish. After the restore completes, the observer then starts with the job still in a FINISHING state. The observer waits for its job to finish, but the job cannot finish because the referenced data was not backed up, and so the job becomes stuck. Messages similar to the following might be seen in the observer log:

execution skipped because current job  had state: FINISHING

For other jobs that do not have a job type of Repeating Schedule, the observer log shows that it is looking for a vertex that no longer exists:

WARN   [2022-09-13 16:06:24,265] [pool-5-thread-1] c.i.i.t.o.t.ObserverVertex -  Failed to poll observer vertex cp4waiops-cartridge.github-observer.  Response : InboundJaxrsResponse{context=ClientResponse{method=POST, uri=https://aiops-topology-topology:8080/1.0/topology/mgmt_artifacts/MKSrcRpkRC6WmWVGGdSukA, status=404, reason=Not Found}}

To resolve these problems, complete the following steps:

  1. Locate jobs that are stuck in the FINISHING state.

    For jobs with a Job type of Repeating schedule, check if the job is stuck by waiting for the normal job duration or by checking the job history. If the job has been stuck for a long time, subsequent runs will be skipped. Use the following steps to navigate to the Observer configuration user interface (UI), view the observer job history and look for job runs that have been continuously skipped since the restore completed.

    1. Log in to IBM Cloud Pak for AIOps console.
    2. Expand the navigation menu (four horizontal bars), then click Define > Integrations.
    3. On the Integrations page, click Add integration.
    4. On the Add integrations* page, click Topology in the Category list that is next to the list of all integrations.
    5. Click the configure, schedule, and manage other observer jobs link in the description for the topology integrations.
    6. Click the three dots icon at the end of the line for the observer job, and select View history.
    7. Click All, and look for runs with Skipped status and a Details entry of execution skipped because current job had state: FINISHING.

    For more information on viewing observer jobs, see the section To access the Observer Configuration UI in Defining observer jobs for application and topology data.

  2. Attempt to restart the observer job.

    1. For each observer job with a Job type of Repeating schedule that you identified in FINISHING state, edit the job in the Observer configuration UI by nominally changing the Optional description field, and saving it.

    2. For other observer jobs that do not have a Job type of Repeating schedule and have a missing vertex issue, delete the related observer pod.

      oc delete pod <pod_name> -n <namespace>
      

      Where

      • <pod_name> is the observer pod to be deleted.
      • <namespace> is the project (namespace) where IBM Cloud Pak for AIOps is installed.

      For example, for an Instana observer job that has a missing vertex issue, delete the Instana observer pod:

      oc get pods -n | grep instana
      aiopsedge-instana-topology-integrator-6b777bd6b4-6nvqx 1/1 Running 0 8d
      
      oc delete pod aiopsedge-instana-topology-integrator-6b777bd6b4-6nvqx -n <cp4aiops-namespace>
      

      This causes the job to stop and restart, and to be shown as Running, Scheduled, or Ready.

      A message similar to the following is seen in the observer log:

      WARN   [2022-08-22 16:19:34,706] [pool-10-thread-1] c.i.i.t.o.a.JobManager -  stop - Stopped job for tenantId:cfd95b7e-3bc7-4006-a4a8-a73a79c71255, uniqueId:1100
      INFO   [2022-08-22 16:19:34,712] [cfd95b7e-3bc7-4006-a4a8-a73a79c71255:1100] c.i.i.t.o.f.j.LoadFileJob -  cfd95b7e-3bc7-4006-a4a8-a73a79c71255:5eb27ee0-772c-4d65-bc0f-4e56b6b7913b Interrupted while waiting for state JobState [state=FINISHED, reason=Job finished after restart, date=Mon Aug 22 16:09:45 GMT 2022]
      INFO   [2022-08-22 16:19:34,944] [cfd95b7e-3bc7-4006-a4a8-a73a79c71255:1100] c.i.i.t.o.f.j.LoadFileJob -  cfd95b7e-3bc7-4006-a4a8-a73a79c71255:1100 Observer job finished with state STOPPED
      

    Some jobs might still be shown with Error if an attempt was made to run the stuck job before the previous remediating steps were attempted. This causes the observer to fail the job, as a job already exists with the same name. In this scenario, a warning similar to the following is seen in the observer log:

    WARN   [2022-08-23 15:58:50,431] [pool-10-thread-1] c.i.i.t.o.a.o.RESTApiException -  pool-10-thread-1 - APIMessage: { httpCode=422, _error={ message=Job Creation Failure, level=warning, description=Cannot create observation job, causes=[{ message=Duplicate unique id, level=error, description=A job with the same unique id '1100' is already being processed., field=unique_id }] } }
    

    If there are no longer any observer jobs with a Status of Error, then exit this troubleshooting.

  3. Complete the following steps only if one or more of your observer jobs are still shown with a Status of Error.

    1. Find the name of the topology pod.

      oc get pods | grep topology-topology
      
    2. Remote shell into the topology container.

      oc rsh <topology_pod> -c <release_name>-topology-topology
      

      Where

      • <topology_pod> is the pod that was returned in the previous step.
      • <release_name> is the name of your IBM Cloud Pak for AIOps instance.
  4. Find observers that still have a job that is stuck in the FINISHING state.

    Run the following command, and note down the returned observerName and _id.

    curl -X GET "https://localhost:8080/1.0/topology/mgmt_artifacts?_filter=hasState%3DFINISHING&_field=keyIndexName&_field=hasState&_field=observerName&_type=ASM_OBSERVER_JOB" -H  "accept: application/json" -H  "X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255" --insecure -u ${ASM_USER}:${ASM_PASS} | jq
    

    Note: ASM_USER and ASM_PASS are environment variables available within the topology container.

    Example output:

    {
     "_executionTime": 13,
     "_offset": 0,
     "_limit": 50,
     "_items": [
       {
         "keyIndexName": "dns-observer:bbc",
         "_id": "z7ovGeNrSKSpMlWY8n7Mfw",
         "observerName": "dns-observer",
         "hasState": "FINISHING"
       }
     ]
    }
    
  5. Run the following steps for each observer job returned by the previous step.

    1. Change the state of the job by running the following command.

      curl -X POST "https://localhost:8080/1.0/topology/mgmt_artifacts/<id>" -H  "accept: application/json" -H  "X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255" -H  "Content-Type: application/json" -d "{  \"hasState\": \"FINISHED\"}" --insecure -u ${ASM_USER}:${ASM_PASS}
      

      Where <id> is the _id of the stuck job that you identified in step 4.

    2. Find the pods for the observer that you need to restart.

      oc get pods -n <namespace> | grep <stuck_observer_name>
      

      Where

      • <stuck_observer_name> is the name of the stuck observer that you identified in step 4.
      • <namespace> is the project (namespace) where IBM Cloud Pak for AIOps is installed.
    3. Delete the pods for the observer.

      oc delete po <observername_pods> -n <namespace>
      

      Where

      • <observername_pods> are the pods returned in the previous step.
      • <namespace> is the project (namespace) where IBM Cloud Pak for AIOps is installed.
    4. Rerun the observer job.

      On the observer configuration UI, click Run on any job for the observer that is in the Error state. The observer job now runs successfully, and has a status of Scheduled or Ready.

Restore process terminated mid-process with partial data available

If the restore process does not complete as expected for a data store, such as if it is aborted or terminated during its run, data might not be restored correctly. This incomplete restored data neeeds to be removed before you run the restore process again. To remove the data, you need to run a post-restore cleanup script to clean up the data store.

To run a script, complete the following steps:

  1. Define the following environment variable on your workstation:

    export WORKDIR="${PATH}/bcdr/4.9.0/"
    
  2. Change to the restore directory where the post-restore cleanup script is located:

    cd <Path>/bcdr/4.9.0/restore/<data_store>/
    

    Where

    • <Path> is the path to where you downloaded and extracted the IBM Cloud Pak® for AIOps backup and restore files.
    • <data_store> is the directory for the data store or custom resource that needs to be cleaned up. For example, couchdb is the directory for CouchDb.
  3. Run the post-restore cleanup script:

    nohup ./<data-store>-post-restore.sh.sh > <data-store>-post-restore.log &
    

    Where <data_store> is the data store or custom resource that needs to be cleaned up.

  4. Run the restore process for that data store or resource again.

For example, if the restore job for restoring the Cassandra data store aborted, you need to run the cassandra-post-restore.sh post-restore script that is stored in the bcdr/restore/cassandra directory to clean the data. Then, run the cassandra-native-post-restore.sh post-restore script.

The following table lists the cleanup script to run for each data store:

Table. Cleanup script for components
Component or Database Cleanup script
Cassandra Run first: bcdr/4.9.0/restore/cassandra/cassandra-post-restore.sh
Then, run: bcdr/4.9.0/restore/cassandra/cassandra-native-post-restore.sh
CouchDB bcdr/4.9.0/restore/couchdb/couchdb-post-restore.sh
Elasticsearch bcdr/4.9.0/restore/elasticsearch/es-post-restore.sh
Metastore bcdr/4.9.0/restore/metastore/metastore-post-restore.sh
Minio bcdr/4.9.0/restore/minio/minio-post-restore.sh
Postgres bcdr/4.9.0/restore/postgres/postgres-post-restore.sh
IBM Cloud Pak foundational services bcdr/4.9.0/restore/common-services/cs-post-restore.sh
Connection CR N/A
Secure Tunnel CR bcdr/4.9.0/restore/other-resources/tunnel-cr-post-restore.sh

OADP restore is stuck in an In progress state

If you notice that the Velero pod is stuck in an In progress state, complete the following steps to stop the restore process:

  1. Delete the velero pod by running the following command:

    oc delete pod <velero pod name> -n <OADP installed namespace>
    
  2. Delete the restore that is stuck in progress by running the following command:

    velero delete restore <restore name>
    

    Wait for the completion of the restore job.

Helm install restore job command failed

When you are running the helm install restore-job clusterrestore-0.1.0.tgz command, you might encounter the command failing with an error that is similar to the following error:

Error: admission webhook "trust.hooks.securityenforcement.admission.cloud.ibm.com" denied the request:
Deny "icr.io/cpopen/cp4waiops/cp4aiops-bcdr@sha256:294a42a851a2717ebbc68528ab3c6bcb1ba48114ff058f1c1b537dc6aa167355", no matching repositories in ClusterImagePolicy and no ImagePolicies in the "velero" namespace

If you encounter this error, complete the following steps to resolve the issue:

  1. Uninstall the restore-job job by running the following command:

    helm uninstall restore-job -n $OADP_NAMESPACE
    
  2. Export an environment variable for the image.

    For an online deployment:

    export REGISTRY=icr.io/cpopen/cp4waiops/cp4aiops-bcdr
    export BCDR_IMAGE=${REGISTRY}/<bcdr_image>
    

    For an offline deployment:

    export REGISTRY=$TARGET_REGISTRY
    export BCDR_IMAGE=${REGISTRY}/<bcdr_image>
    

    Where <bcdr_image> is the name of the backup and restore image, as given in the backup helm chart values.yaml file, in the form cp4waiops-bcdr@{digest}. An example value for BCDR_IMAGE is icr.io/cpopen/cp4waiops/cp4waiops-bcdr@sha256:294a42a851a2717ebbc68528ab3c6bcb1ba48114ff058f1c1b537dc6aa167355.

  3. Create a restore-image-policy.yaml file and add the following content within the file:

    apiVersion: securityenforcement.admission.cloud.ibm.com/v1beta1
    kind: ClusterImagePolicy
    metadata:
      name: restore-image-policy
    spec:
     repositories:
      - name: ${BCDR_IMAGE}
        policy:
    
  4. Apply the policy by running the following command:

    oc apply -f restore-image-policy.yaml
    
  5. Deploy the restore job by running the following command:

    helm install restore-job clusterrestore-0.1.0.tgz
    

Cassandra restore fails with org.apache.cassandra.io.FSReadError exception

Policies are not restored after restore completed. Although there were a couple of key repair failures, the log indicated the Cassandra restore completed. However, the policies from the backup cluster were not restored.

You can encounter an issue where the Cassandra restore fails. If this issue occurs, the aiops-topology-cassandra-2 pod is in 0/1 status and the pod log contains an error message that is similar to the following error message:

ERROR [CompactionExecutor:8] 2022-11-29 21:39:17,324 CassandraDaemon.java:244 - Exception in thread Thread[CompactionExecutor:8,1,main]
org.apache.cassandra.io.FSReadError: java.io.IOException: Channel not open for writing - cannot extend file to required size

If this error occurs, complete the following steps for the Cassandra statefulset:

  1. Scale down the statefulset aiops-topology-cassandra to 0.

    Important: Make a note of the current scaling of the Cassandra statefulset before you scale down.

    oc scale statefulsets aiops-topology-cassandra --replicas 0 -n <namespace>
    

    Where: <namespace> is the namespace where IBM Cloud Pak for AIOps is installed.

  2. Increase the statefulset memory limit to 32 GB.

    1. Access the resource for editing:

      oc edit statefulset aiops-topology-cassandra -n <namespace>
      
    2. Increase the memory limit:

  3. Scale the statefulset back to the initial number of replicas.

    oc scale statefulsets aiops-topology-cassandra --replicas=<number_of_replicas> -n <namespace>
    

    Where:

    • <namespace> is the namespace where IBM Cloud Pak for AIOps is installed.
    • <number_of_replicas> is the number of replicas the StatefulSet it to be scaled up to.
  4. Run the restore again.

  5. Scale down the statefulset aiops-topology-cassandra to 0 again.

    oc scale statefulsets aiops-topology-cassandra --replicas 0 -n <namespace>
    
  6. Change the statefulset memory limit to 16 GB .

    1. Access the resource for editing:

      oc edit statefulset aiops-topology-cassandra -n <namespace>
      
    2. Decrease the memory limit to 16 GB.

  7. Scale the statefulset back to the initial number of replicas.

    oc scale statefulsets aiops-topology-cassandra --replicas=<number_of_replicas> -n <namespace>
    

    Where:

    • <namespace> is the namespace where IBM Cloud Pak for AIOps is installed.
    • <number_of_replicas> is the number of replicas the StatefulSet it to be scaled up to.

Integrations missing after a restore

After you complete a full restore, if you do not see all of the data integrations that you expect, run the command to restore data for the Metastore component to restore the integrations. For instructions, see Restoring individual components.

Netcool integration is missing details after restore

If you had a Netcool integration defined, then the Netcool integration might be missing some details after restore, such as the ObjectServer details. If you try to edit these details then the test still fails. To resolve this problem, delete the Netcool integration, and then create a new Netcool integration.

Restore fails with lifecycletrigger: Not Ready

If lifecycletrigger is showing Not Ready after the restore, then run the following command:

oc patch lifecycletrigger aiops --type=json --patch="$(oc get lifecycletrigger aiops -o jsonpath='{"[{"}"op":"add","path":"/spec/cancelJobs","value":[{range .status.jobs[*]}"{.jid}",{end}]}]' | sed 's/,]/]/g')"

Cannot login to the Cloud Pak for AIOps console on the restore cluster

After the restore, the IBM Cloud Pak for AIOps console is inaccessible and there is an error similar to the following:

`CWOAU0061E: The OAuth service provider could not find the client because the client name is not valid`

Solution: Run the following steps on the restore cluster:

  1. Export an environment variable containing your project's name.

    export PROJECT=<project>
    

    Where <project> is the namespace (project) that IBM Cloud Pak for AIOps is deployed in on the restore cluster.

  2. Delete the platform pods.

    oc delete pod -n ${PROJECT} -l component=platform-auth-service
    oc delete pod -n ${PROJECT} -l component=platform-identity-management
    oc delete pod -n ${PROJECT} -l component=platform-identity-provider
    
  3. Update and delete the iam-config-job YAML.

    oc get  -n ${PROJECT} job iam-config-job  -o json > /tmp/iam-config-job.json
    oc -n ${PROJECT} delete job iam-config-job
    jq 'del(.metadata.creationTimestamp) | del(.metadata.managedFields) | del(.metadata.resourceVersion) | del(.metadata.uid) | del(.spec.selector) | del(.spec.template.metadata.labels) | del(.status)' /tmp/iam-config-job.json > /tmp/updated-iam-config-job.json
    
  4. Wait for a few seconds and then apply the modified YAML to rerun iam-config-job.

    oc -n ${PROJECT} apply -f /tmp/updated-iam-config-job.json
    oc -n ${PROJECT} get pods | grep iam-config-job
    

Policy configuration is not retained on the restored deployment

Incident policies that previously had a ticket integration are missing the ticket integration on the restored deployment. If GitHub or ServiceNow are configured, then external tickets are not created in them. The policy page has a Warning similar to the following example:

The previously selected ticket connection Github -mygit is no longer available. Please check the status of your connector.

Solution: From the IBM Cloud Pak for AIOps console, edit the policy and add the connection that is shown in the warning.

Troubleshooting the Infrastructure Automation restore

If you are also restoring Infrastructure Automation data and encounter an issues with the restore process for Infrastructure Automation or encounter an issue with data not being available or processed after the restore, see Troubleshooting the Infrastructure Automation restore

Rebroadcast data to ElasticSearch fails for Cassandra restore

You might notice that the rebroadcast data to ElasticSearch fails when you try to restore Cassandra. The error might be shown in aiops-topology-topology-xxx pod:

===== EXECUTING COMMAND in pod: aiops-topology-topology-55f5dfc5db-xjdxt   =====
Defaulted container "aiops-topology-topology" out of: aiops-topology-topology, wait-for-cassandra (init)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 -::- -::- -::-     0*   Trying ::1...

TCP_NODELAY set connect to ::1 port 8080 failed: Connection refused   Trying 127.0.0.1... TCP_NODELAY set connect to 127.0.0.1 port 8080 failed: Connection refused Failed to connect to localhost port 8080: Connection refused Closing connection 0
curl: (7) Failed to connect to localhost port 8080: Connection refused
command terminated with exit code 7

Use the following steps to resolve the restore issue with Cassandra. Apply this workaround on restore cluster after the restore operation:

  1. Run the following command for timestamp:

    echo "[INFO] $(date) Rebroadcasting data to ElasticSearch"
    
  2. Set the project (namespace) where Cloud Pak for AIOps is installed:

    namespace=<namespace>
    
  3. Save the username and password in environment variables:

    ASM_USER=$(kubectl -n $namespace get secret aiops-topology-asm-credentials -o jsonpath="{.data['username']}" | base64 -d)
    
    ASM_PASS=$(kubectl -n $namespace get secret aiops-topology-asm-credentials -o jsonpath="{.data['password']}" | base64 -d)
    
  4. Rebroadcast a data point in the Cassandra cluster:

    kubectl exec ${cassandra-pod} -n $namespace  -- bash -c  "curl -vX POST 'https://localhost:8080/1.0/topology/crawlers/rebroadcast' -H 'X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255' --insecure -u $ASM_USER:$ASM_PASS"
    

In addition, review the Known issues with the backup and restore process.

Known issues with the backup and restore process

Elasticsearch health status yellow after restore

When you are restoring an Elasticsearch backup to a new single node Elasticsearch cluster, the Elasticsearch database might not work as expected. Instead, the Elasticsearch cluster health shows a yellow status after the restore completes.