Restoring IBM Cloud Pak for AIOps data
Learn how to restore data for IBM Cloud Pak for AIOps components to a cluster, such as for disaster recovery.
The following procedure restores all backed up data that exists in the specified backup for IBM Cloud Pak for AIOps components. The steps in the following procedure restore IBM Cloud Pak for AIOps in a new cluster.
Before you begin
- All required storage classes must be created prior to running the restore process. The storage classes must have the same name as the backup cluster.
- You can restore a backup only within an environment that has the same version of IBM Cloud Pak for AIOps as the environment where the backup was created. For example, a backup of an IBM Cloud Pak for AIOps 4.9.0 environment must be restored within a cluster that has IBM Cloud Pak for AIOps 4.9.0 installed. If you need to upgrade as well as restore data, complete the restore process before you upgrade.
- If you are also restoring Infrastructure Automation data, the overall procedure is the same with the following additional steps required:
- After you install IBM Cloud Pak for AIOps on the cluster where you are restoring data, you need to install Infrastructure Automation.
- After you restore your IBM Cloud Pak for AIOps data, you need to restore your Infrastructure Automation data. For more information, see Restoring Infrastructure Automation.
Restore procedure
Follow the steps to restore IBM Cloud Pak for AIOps from backup.
- Set up your new cluster for restore
- Prepare the backup data for restoring
- Restore the cluster namespaces and install IBM Cloud Pak for AIOps
- Restore the IBM Cloud Pak for AIOps data
- (Optional) Restore Infrastructure Automation data
- Post-restore tasks
If you encounter any issues with the restore process, see Troubleshooting
1. Set up your new cluster for restore
-
Install Red Hat OpenShift by using the instructions in the Red Hat OpenShift documentation
.
IBM Cloud Pak for AIOps requires OpenShift to be installed and running. You must have administrative access to your OpenShift cluster.
Important: Ensure that the version of Red Hat OpenShift Container Platform that you install is the same as the version that was installed in the backed up cluster.
For information on supported versions of OpenShift, see Supported Red Hat OpenShift Container Platform versions.
Note: IBM Cloud Pak for AIOps uses the OpenShift image registry when it builds images in real time. If the OpenShift image registry is not persistent and the registry respawns, then workloads can temporarily fail until respawning is complete. A persistent OpenShift image registry is recommended to avoid this. For more information, see Setting up and configuring the registry
in the Red Hat OpenShift Container Platform documentation.
-
Install the OpenShift command line interface (
oc
) on your cluster's boot node and runoc login
, using the instructions in Getting started with the OpenShift CLI.
-
Configure storage
The storage configuration must satisfy your sizing requirements. For more information on the storage classes that are needed for installing IBM Cloud Pak for AIOps, see Storage.
Important: All required storage classes must be created prior to running the restore process. The storage classes must have the same name as the backup cluster.
-
Install the backup and restore tools
Install the Red Hat OpenShift APIs for Data Protection (OADP) in the Red Hat OpenShift Container Platform cluster. For more information, see Installing the backup and restore tools.
Important: Ensure that the OpenShift APIs for Data Protection (OADP) is configured to point to the same object storage (S3 bucket) that includes the backup that you plan to use.
-
Export the environment variables that you will need for the restore procedure.
If you are restoring to an online deployment, set the following:
export PATH=<path> export OADP_NAMESPACE=<oadpNamespace> export AIOPS_NAMESPACE=<aiops_namespace>
If you are restoring to an offline deployment, set the following:
export TARGET_REGISTRY_HOST=<target_registry_host> export TARGET_REGISTRY_PORT=<port> export TARGET_REGISTRY=$TARGET_REGISTRY_HOST:$TARGET_REGISTRY_PORT export TARGET_REGISTRY_USER=<username> export TARGET_REGISTRY_PASSWORD=<password> export EMAIL=<email> export PATH=<path> export OADP_NAMESPACE=<oadpNamespace> export AIOPS_NAMESPACE=<aiops_namespace>
Where:
<target_registry_host>
is the IP address or FQDN of the target registry that holds the backup and restore images, from Offline deployments only: Mirror the backup and restore images<port>
is the port_number of the target registry<username>
is the username for the target registry<password>
is the password for the target registry<email>
is the email for the target registry<path>
is the path to where you downloaded and extracted the IBM Cloud Pak for AIOps backup and restore files.<oadpNamespace>
is the OADP namespace<aiops_namespace>
is the namespace for the IBM Cloud Pak for AIOps deployment.
-
Run the following command to uninstall any helm charts that were previously used for the restore of another instance or version of IBM Cloud Pak for AIOps.
helm uninstall restore-job -n ${OADP_NAMESPACE}
2. Prepare the backup data for restoring
Verify your backed up data and prepare the data for restoring.
-
Check the backup status
Check the backup status to ensure that the backup that you want to restore in your cluster is complete. Run the following command to check the contents of the backup:
velero describe backup <backup-name> --details
The output should list the backed-up data for IBM Cloud Pak for AIOps (
cp4aiops/*
) and for IBM Cloud Pak foundational services. If you also backed up Infrastructure Automation data, this data (cp4aiops/*
) should also be listed.The output should resemble the following sample output:
v1/PersistentVolumeClaim: - cp4aiops/back-aiops-topology-cassandra-0 - cp4aiops/data-c-example-couchdbcluster-m-0 - cp4aiops/export-aimanager-ibm-minio-0 - cp4aiops/aiops-ibm-elasticsearch-es-server-snap - cp4aiops/postgres-backup-data - cp4aiops/metastore-backup-data v1/Pod: - cp4aiops/backup-back-aiops-topology-cassandra-0 - cp4aiops/backup-data-c-example-couchdbcluster-m-0 - cp4aiops/backup-export-aimanager-ibm-minio-0 - cp4aiops/backup-metastore - cp4aiops/es-backup - cp4aiops/backup-metastore - cp4aiops/dummy-db v1/Secret: - cp4aiops/aimanager-ibm-minio-access-secret - cp4aiops/aiops-ir-core-model-secret - cp4aiops/icp-serviceid-apikey-secret Velero-Native Snapshots: <none included> Restic Backups: Completed: cp4aiops/backup-back-aiops-topology-cassandra-0: backup cp4aiops/backup-data-c-example-couchdbcluster-m-0: backup cp4aiops/backup-export-aimanager-ibm-minio-0: backup cp4aiops/backup-metastore: data cp4aiops/es-backup: elasticsearch-backups cp4aiops/backup-postgres: backup
-
Package and install the Helm Chart
-
Change to the
restore
directory where you need to package the Helm Chart:cd ${PATH}/bcdr/4.9.0/restore
-
Update the following parameters in the
values.yaml
file. The file is located in the./helm
directory:backupName
- The name of the backup that you are restoring.aiopsNamespace
- The namespace where IBM Cloud Pak for AIOps is installed.csNamespace
- The namespace where IBM Cloud Pak foundational services is installed. In IBM Cloud Pak for AIOps v4.9.0 this is the same as the namespace where IBM Cloud Pak for AIOps is installed.oadpNamespace
- The namespace where OADP is installed.
-
Package the Helm Chart.
helm package ./helm
-
Install the Helm Chart for restoring data by running the following job:
helm install restore-job clusterrestore-0.1.0.tgz
-
3. Restore the cluster namespaces and install IBM Cloud Pak for AIOps
Since the restore job does not install IBM Cloud Pak for AIOps, you need to first install IBM Cloud Pak for AIOps before you can run the restore jobs for restoring database and component data. A restore job for restoring the cluster namespaces is available and must be run before you install IBM Cloud Pak for AIOps.
-
Restore the cluster namespaces
You need to restore the projects (namespaces) of the backed up cluster so that your new cluster includes the metadata with the SELinux settings that need to match the settings for the backup data that you plan to restore.
These steps only restore the namespaces and namespace metadata. The commands do not restore the contents of the namespaces.
Note: In IBM Cloud Pak for AIOps v4.9.0, the IBM Cloud Pak foundational services namespace is the same as the IBM Cloud Pak for AIOps namespace.
-
Change to the
restore
directory where the restore script is located:cd ${PATH}/bcdr/4.9.0/restore
-
Optional. Delete any existing namespace restore jobs:
oc delete -f ns-restore-job.yaml
-
Create a job to restore the cluster namespaces:
oc create -f ns-restore-job.yaml
-
Check the restore job logs by running the following command:
oc logs -f <ns-restore-job-***>
-
Check the velero-restore status for the namespace by running the following command:
velero get restore <RESTORE_NAME>
Where
<RESTORE_NAME>
is the name of the restore for namespace. You can see the restore name after the restore job is completed. For example, you might see the restore nameaiops-namespace-restore-20221006054710
within the restore job log as follows:Restore request "aiops-namespace-restore-20221006054710" submitted successfully.
Ensure that the projects (namespaces) are restored before you proceed.
-
-
Create a network policy.
For more information about creating a
NetworkPolicy
, see Creating a network policy.
-
Create a network policy file called
policy-bcdr.yaml
with the following contents:apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: bcdr-np namespace: <aiopsNamespace> spec: podSelector: matchLabels: ibm-es-server: aiops-ibm-elasticsearch-es-server ingress: - from: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: <oadpNamespace> policyTypes: - Ingress
Where:
<aiopsNamespace>
is the value of $AIOPS_NAMESPACE<oadpNamespace>
is the value of $OADP_NAMESPACE
-
Apply the network policy to your cluster to allow the required ingress traffic.
oc apply -f policy-bcdr.yaml -n ${AIOPS_NAMESPACE}
-
-
Install IBM Cloud Pak for AIOps
For more information, see Installing IBM Cloud Pak for AIOps.
Important:
- Ensure that the version of IBM Cloud Pak for AIOps that you are installing is the same as the version that was installed in the backed up cluster.
- The backup includes keys and certificates from the backed up cluster. Ensure that your new cluster is configured to support the use of these keys and certificates so that the restored data can be accessed.
- Wait until the installation is complete and all pods in the IBM Cloud Pak for AIOps project (namespace) are running before you proceed.
-
Optional. If you also need to restore Infrastructure Automation data, you need to install the Infrastructure Automation operators and create the required custom resources.
- For more information about installing Infrastructure Automation, see Installing Infrastructure Automation.
- For more information about restoring Infrastructure Automation, see Restoring Infrastructure Automation.
4. Restore the IBM Cloud Pak for AIOps data
This procedure restore the data for all backed up IBM Cloud Pak for AIOps databases and components. Optionally, you can choose to restore only individual components. For details, see Restoring individual components.
-
Change to the
restore
directory where the restore script is located:cd ${PATH}/bcdr/4.9.0/restore
-
Optional. Delete any existing IBM Cloud Pak for AIOps restore jobs:
oc delete -f aiops-restore-job.yaml
-
Create a job to restore IBM Cloud Pak for AIOps:
oc create -f aiops-restore-job.yaml
-
Check the restore job logs by running the following command:
oc logs -f <aiops-restore-job-***>
-
Check the velero-restore status by running the following command:
velero get restore <RESTORE_NAME>
Where
<RESTORE_NAME>
is the name of the restore for IBM Cloud Pak for AIOps. You can see the this restore name after the restore job is completed. For example, you might see the restore namecassandra-restore-20221006054710
for the Cassandra restore in the restore job log as follows:Restore request "cassandra-restore-20221006054710" submitted successfully.
Similarly, the OADP restore name for other IBM Cloud Pak for AIOps components can display in the restore job log.
Note: You might notice that the
aiops-configmaps
resource group shows as failed under the Restore sequence section of the IBM Cloud Pak for AIOps restore recipe but the overall restore is successful and works as intended.
5. (Optional) Restore the Infrastructure Automation data
If you are also restoring Infrastructure Automation data, run the commands to restore the Infrastructure Automation data. For more information, see Restoring Infrastructure Automation.
6. Post-restore tasks
-
Update your integrations.
If you have an integration that you are restoring, the status for these integrations can be in error after the restore process completes. To resolve this status, you need to edit and save your integrations with the Integrations feature in the IBM Cloud Pak for AIOps console. Editing the integration regenerates the associated Flink job for the integration, which updates the status. For more information about editing these integrations, see Defining integrations. Sometimes an integration can fail to gather data but not show an error status. To resolve this, edit and save the integration.
-
If you are restoring data for any integrations that are remotely deployed, complete the following steps to redeploy the integrations:
-
On the remote cluster, delete any existing integrations.
-
For each remotely deployed integration that you need to redeploy, Copy or download a new bootstrap command for the deployment script from cluster where the restore process ran.
You can click the Copy to clipboard button to copy the command, or click Download as a sh-file to download the command as a
bootstrap.sh
file.Note: If you do not obtain the command now, you can copy or download the command later by completing the following steps:
-
Log in to IBM Cloud Pak for AIOps console.
-
Expand the navigation menu (four horizontal bars), then click Define > Integrations.
-
For each integration, click the integration type on the Manage integrations tab of the Integrations page.
-
On the page for the integration type, click the Download link in the Remote deployment script column for the integrations.
-
Either copy or download the bootstrap command for the deployment script.
-
-
Run the remote deployment scripts on the remote cluster to redeploy the integrations.
-
Get the OpenShift CLI
oc login
command from the remote cluster. -
From a command line, log in to the remote cluster with the
oc login
command. -
Switch to the target project (namespace).
-
Run the deployment script as a
sh
(script file), such as the downloadablebootstap.sh
file.
-
-
-
If you are resoring secure tunnels data, install the secure tunnel connections for any restored secure tunnel data.
-
Enable authentication with the restored JWT certificate.
-
Retrieve the new
initial_admin_password
from theadmin-user-details
secret.export PROJECT=<project> PASS=$(oc get secret admin-user-details -n ${PROJECT} -o jsonpath='{.data.initial_admin_password}' | base64 -d)
-
Disable the
admin
user.oc rsh -n ${PROJECT} $(oc get pod -l component=usermgmt -n ${PROJECT} | tail -1 | cut -f1 -d\ ) /usr/src/server-src/scripts/manage-user.sh --disable-user admin
-
Enable the
admin
user with the new password.echo "$PASS" | oc rsh -n ${PROJECT} $(oc get pod -l component=usermgmt -n ${PROJECT} | tail -1 | cut -f1 -d\ ) /usr/src/server-src/scripts/manage-user.sh --enable-user admin
-
Delete the secret
app-api-user-jwt
. It will be recreated automatically after a few seconds.oc delete secret app-api-user-jwt -n ${PROJECT}
-
Restoring individual components
If needed, you can choose to restore data for specific individual databases and components instead of for all databases and components at once.
-
Change to the
restore
directory where the restore script is located:cd ${PATH}/bcdr/4.9.0/restore
-
Copy the
aiops-restore-job.yaml
file. Rename your new file<component>-restore-job.yaml
, where<component>
is the name of the component that you are restoring.For example, if you are restoring Cassandra, rename the file to be
cassandra-restore-job.yaml
. -
Open the new
<component>-restore-job.yaml
file for editing. Update thename
andcommand
sections to match the values for the component that you are restoring:- Update the
name
of the restore job in the metadata section to be the individual component job, such ascassandra-restore-job
. - Update the command section
command: ["/bin/bash", "restore.sh","-aiops"]
to replace-aiops
with the respective argument for the component that you are restoring. For the list of component arguments, see the table that follows this procedure.
- Update the
-
Create a job to restore the individual component.
oc create -f <component>-restore-job.yaml
-
Check the restore job logs by running the following command:
oc logs -f <cp4waiops-component-restore-job-***>
-
Check the velero-restore status by running the following command:
velero get restore <RESTORE_NAME>
Where
<RESTORE_NAME>
is one of the restore names for the component. You can view the names for the component in the restore job log when the restore job is completed. For example, the restore name for a Cassandra restore can becassandra-restore-20221006054710
, which can display in an entry similar to the following example log entry:Restore request "cassandra-restore-20221006054710" submitted successfully.
Component or database arguments for restore job command configuration
Component or Database | Argument |
---|---|
Cassandra |
-cassandra |
CouchDB |
-couchdb |
Elasticsearch |
-es |
Metastore |
-metastore |
Minio |
-minio |
Postgres |
-postgres |
IBM Cloud Pak foundational services | -cs |
Intergration CR | -connectioncr |
Secure Tunnel CR | -tunnelcr |
Troubleshooting
- Stale alert data exists after restoration
- Restore process for ElasticSearch failed
- Restore fails with CouchDB pod in CrashLoopBackOff
- LDAP user login is not working after a restore
- Data is not being processed after restoration
- Applications still show non-existent active incidents and alerts notations
- Topology observer jobs do not run after restore
- Restore process terminated mid-process with partial data available
- OADP restore is stuck in an In progress state
- Helm install restore job command failed
- Cassandra restore fails with
org.apache.cassandra.io.FSReadError
exception - Integrations missing after a restore
- Netcool integration is missing details after restore
- Restore fails with lifecycletrigger: Not Ready
- Cannot login to the Cloud Pak for AIOps console
- Policy configuration is not retained on the restored deployment
- Troubleshooting the Infrastructure Automation restore
- Rebroadcast data to ElasticSearch fails for Cassandra restore
Stale alert data exists for topology resources after restoration
After you complete a restore, you might notice notice that stale alert data exists for topology resources. For instance, if you search for a resource that had alert data, you might see outdated alert data. To clear the historical alert data,
you can run the statusClear
crawler.
You can run the crawler to clear stale events by using the topology service Swagger pages:
- Open the Swagger API for the topology service. For more information, see Accessing Topology service Swagger UI.
- Go to the Crawlers section.
- Open the statusClear crawler.
- Run the crawler with the default message body.
This crawler runs asynchronously over the topology data. The crawler POST response returns an EntityId header, which you can use to check the progress of the crawler. To check the progress include the EntityId header in a GET /mgmt_artifacts/{id}
call. The response shows the status of the asynchronous crawl.
For more information about the topology service Swagger, see Application and topology APIs.
Restore process for ElasticSearch failed
If the restore of a particular data store, such as ElasticSearch, or custom resource fails, complete the following steps before you can attempt the restore again.
-
Optional. If this cluster was previously configured to enable a backup of ElasticSearch, or you need to rerun restore, complete the following steps:
-
Remove the backup path and Snapshot location configurations from the AutomationBase CR.
-
Delete the ElasticSearch backup snapshot PVC by running the following command. This deletion is needed since this PCV is replaced with the ElasticSearch backup data and the
es-restore1
.elasticsearch_cluster=$(kubectl get elasticsearchclusters.elasticsearch.opencontent.ibm.com -n ${AIOPS_NAMESPACE} -o jsonpath='{.items[0].metadata.name}') esbackupPVC="$elasticsearch_cluster-ibm-elasticsearch-es-server-snap" oc delete pvc -n ${AIOPS_NAMESPACE} $esbackupPVC
-
Run the following command to delete the snapshot and
es-restore1
if it exists:oc delete restore es-restore1 -n $OADP_NAMESPACE
-
-
Repeat the process to restore the data store or custom resource to reattempt the failed restore.
Restore fails with CouchDB pod in CrashLoopBackOff
This problem can occur if you restore an IBM Cloud Pak for AIOps backup to a different cluster or namespace and the database files in the persistent storage are not readable or writeable by the root user group.
When this issue occurs, the pod c-example-couchdbcluster-m
has a status of CrashLoopBackOff, as in the following example.
oc get pods | grep couchdb
c-example-couchdbcluster-m-0 1/2 CrashLoopBackOff 533 (4m13s ago) 2d5h
c-example-couchdbcluster-m-0-debug 1/2 Running 0 2m39s
The logs from the database container in the failing pods contain the message Could not open file /data/db/_nodes.couch: permission denied
, as in the following example:
oc logs c-example-couchdbcluster-m-0 -c db --tail=100
[error] 2022-11-10T10:39:34.788855Z couchdb@c-example-couchdbcluster-m-0.c-example-couchdbcluster-m <0.322.0> -------- Could not open file /data/db/_nodes.couch: permission denied
To resolve this issue, grant the root group write permissions to the folders in the persistent storage that contain the database files.
-
Get the name of the persistent volume that contains the CouchDB database files.
oc get pv | grep couch
Example output:
oc get pv | grep couch pvc-d6599cfb-aa38-47be-857e-a2efa9ee0407 5Gi RWO Delete Bound katamari/data-c-example-couchdbcluster-m-0 rook-cephfs 2d5h
-
Find the worker node for the failing CouchDB pod.
oc get pod -o wide c-example-couchdbcluster-m-0
Example output:
oc get pod -o wide c-example-couchdbcluster-m-0 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES c-example-couchdbcluster-m-0 1/2 CrashLoopBackOff 537 (76s ago) 2d5h 10.254.15.54 worker1.bcdr-test-12345678.mysite.com <none> <none>
-
Debug the worker node that you identified in the previous step.
oc debug node/<worker_node>
Where
<worker_node>
is the value returned for NODE in the previous step.Example output:
oc debug node/worker1.bcdr-test-12345678.mysite.com Starting pod/worker1bcdr-rtp-26101516cpmyclustercom-debug ... To use host binaries, run `chroot /host` Pod IP: 10.22.1.1 If you don't see a command prompt, try pressing enter.
-
Find the mount point of the persistent volume.
mount | grep <pv>
Where
<pv>
is the persistent volume name returned in step 1.Example output:
sh-4.4# mount | grep pvc-d6599cfb-aa38-47be-857e-a2efa9ee0407 172.30.2.127:6789,172.30.43.40:6789,172.30.11.17:6789:/volumes/csi/csi-vol-0bb3b8e3-5f26-11ed-97a6-0a580afe2807/af21d8e8-f29e-4111-8fcd-c4366d604c99 on /host/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d6599cfb-aa38-47be-857e-a2efa9ee0407/globalmount type ceph (rw,relatime,seclabel,name=csi-cephfs-node,secret=<hidden>,acl,mds_namespace=rook-cephfilesystem) 172.30.2.127:6789,172.30.43.40:6789,172.30.11.17:6789:/volumes/csi/csi-vol-0bb3b8e3-5f26-11ed-97a6-0a580afe2807/af21d8e8-f29e-4111-8fcd-c4366d604c99 on /host/var/lib/kubelet/pods/26fc2b39-aaf0-4add-8649-6cfd56fc4a3b/volumes/kubernetes.io~csi/pvc-d6599cfb-aa38-47be-857e-a2efa9ee0407/mount type ceph (rw,relatime,seclabel,name=csi-cephfs-node,secret=<hidden>,acl,mds_namespace=rook-cephfilesystem)
The mount point is the directory ending with
mount
. In the above example output this is/host/var/lib/kubelet/pods/26fc2b39-aaf0-4add-8649-6cfd56fc4a3b/volumes/kubernetes.io~csi/pvc-d6599cfb-aa38-47be-857e-a2efa9ee0407/mount
-
Change the permissions on the mount directory by changing to the directory and running
chmod
.cd <mount_dir> chmod -R g+w ls -lah db/ exit
Where
<mount_dir>
is the directory you identified in the previous step.Example output:
sh-4.4# cd /host/var/lib/kubelet/pods/26fc2b39-aaf0-4add-8649-6cfd56fc4a3b/volumes/kubernetes.io~csi/pvc-d6599cfb-aa38-47be-857e-a2efa9ee0407/mount sh-4.4# ls db view_index sh-4.4# ls -lah db/ total 37K drwxr-xr-x. 4 1000650000 root 4 Nov 7 04:55 . drwxrwxrwx. 5 root root 3 Nov 10 10:52 .. drwxr-xr-x. 2 1000650000 root 0 Nov 7 04:54 .delete -rw-r--r--. 1 1000650000 root 29K Nov 7 05:31 _dbs.couch -rw-r--r--. 1 1000650000 root 8.3K Nov 7 04:54 _nodes.couch drwxr-xr-x. 4 1000650000 root 2 Nov 7 04:55 shards sh-4.4# chmod -R g+w . sh-4.4# ls -lah db/ total 37K drwxrwxr-x. 4 1000650000 root 4 Nov 7 04:55 . drwxrwxrwx. 5 root root 3 Nov 10 10:52 .. drwxrwxr-x. 2 1000650000 root 0 Nov 7 04:54 .delete -rw-rw-r--. 1 1000650000 root 29K Nov 7 05:31 _dbs.couch -rw-rw-r--. 1 1000650000 root 8.3K Nov 7 04:54 _nodes.couch drwxrwxr-x. 4 1000650000 root 2 Nov 7 04:55 shards sh-4.4# exit
-
Run the following command to verify that the CouchDB pod has a status of
Running
.oc get pods | grep couchdb
Example output:
c-example-couchdbcluster-m-0 2/2 Running 539 (87s ago) 2d5h
LDAP user login is not working after a restore
Follow the steps to solve the problem:
- Log in to the console as the default admin user.
- From the main navigation menu, click Administer > Identify and access.
- Select the LDAP connection, and click Edit connection. Edit the LDAP connection with the correct information.
- Click Test connection.
- Click Save once the connection is success.
- log in to the console with the LDAP user's credentials.
Data is not being processed after restoration
After you complete a restore, you might notice that data for some AI modeling features, such as log anomaly detection, change risk, similar tickets, might not be processed.
This can occur when expected Kafka topics for the AI modeling is not present after the restore completes. To resolve this problem, restart the aimanager-aio-controller
pod. This restart can result in the creation of the expected
Kafka topics.
Applications still show non-existent active incidents and alerts notations
After you complete a restore, you might notice notice that active incidents and alerts notations display for applications when the incidents and notations no longer exist. When the topology data store is backed up and restored, it contains restored data which is out of sync with the rest of the system due to the way the data is modelled against groups, applications, and resources within the topology.
If your data is outdated, you need to manually delete the topology association between the incidents and applications. Use the topology API to clear out any groups or events. Once removed, the changes are incorporated into the topology view.
To remove groups (incident groups of entity waiopsStory
) use the API:
GET /topology/groups?_type=waiopsStory
DELETE /topology/groups?_type=waiopsStory
To remove events, remove the incorrect status from the topology with the following API:
GET /topology/status
DELETE /topology/status
Alternatively, you can use the Resource management tool in the UI to remove non-existant alert indiciations from a topology. When you are viewing a topology in the UI, open the Settings menu and select Topology configuration. From theData administration routines page, run the Status clear routine to remove the alert indications.
Topology observer jobs do not run after restore
After running the restore procedure, some observer jobs are offline, have a Status of Error
, or have a Status of Running
even though job runs are being skipped.
For jobs with a job type of Repeating Schedule
, this problem can occur when a restore ran from a backup that was taken while the observer job was waiting for data processing to finish. After the restore completes, the observer
then starts with the job still in a FINISHING state. The observer waits for its job to finish, but the job cannot finish because the referenced data was not backed up, and so the job becomes stuck. Messages similar to the following might
be seen in the observer log:
execution skipped because current job had state: FINISHING
For other jobs that do not have a job type of Repeating Schedule
, the observer log shows that it is looking for a vertex that no longer exists:
WARN [2022-09-13 16:06:24,265] [pool-5-thread-1] c.i.i.t.o.t.ObserverVertex - Failed to poll observer vertex cp4waiops-cartridge.github-observer. Response : InboundJaxrsResponse{context=ClientResponse{method=POST, uri=https://aiops-topology-topology:8080/1.0/topology/mgmt_artifacts/MKSrcRpkRC6WmWVGGdSukA, status=404, reason=Not Found}}
To resolve these problems, complete the following steps:
-
Locate jobs that are stuck in the
FINISHING
state.For jobs with a Job type of
Repeating schedule
, check if the job is stuck by waiting for the normal job duration or by checking the job history. If the job has been stuck for a long time, subsequent runs will be skipped. Use the following steps to navigate to the Observer configuration user interface (UI), view the observer job history and look for job runs that have been continuously skipped since the restore completed.- Log in to IBM Cloud Pak for AIOps console.
- Expand the navigation menu (four horizontal bars), then click Define > Integrations.
- On the Integrations page, click Add integration.
- On the Add integrations* page, click Topology in the Category list that is next to the list of all integrations.
- Click the configure, schedule, and manage other observer jobs link in the description for the topology integrations.
- Click the three dots icon at the end of the line for the observer job, and select View history.
- Click All, and look for runs with
Skipped
status and a Details entry ofexecution skipped because current job had state: FINISHING
.
For more information on viewing observer jobs, see the section
To access the Observer Configuration UI
in Defining observer jobs for application and topology data. -
Attempt to restart the observer job.
-
For each observer job with a Job type of
Repeating schedule
that you identified in FINISHING state, edit the job in the Observer configuration UI by nominally changing the Optional description field, and saving it. -
For other observer jobs that do not have a Job type of
Repeating schedule
and have a missing vertex issue, delete the related observer pod.oc delete pod <pod_name> -n <namespace>
Where
<pod_name>
is the observer pod to be deleted.<namespace>
is the project (namespace) where IBM Cloud Pak for AIOps is installed.
For example, for an Instana observer job that has a missing vertex issue, delete the Instana observer pod:
oc get pods -n | grep instana aiopsedge-instana-topology-integrator-6b777bd6b4-6nvqx 1/1 Running 0 8d oc delete pod aiopsedge-instana-topology-integrator-6b777bd6b4-6nvqx -n <cp4aiops-namespace>
This causes the job to stop and restart, and to be shown as Running, Scheduled, or Ready.
A message similar to the following is seen in the observer log:
WARN [2022-08-22 16:19:34,706] [pool-10-thread-1] c.i.i.t.o.a.JobManager - stop - Stopped job for tenantId:cfd95b7e-3bc7-4006-a4a8-a73a79c71255, uniqueId:1100 INFO [2022-08-22 16:19:34,712] [cfd95b7e-3bc7-4006-a4a8-a73a79c71255:1100] c.i.i.t.o.f.j.LoadFileJob - cfd95b7e-3bc7-4006-a4a8-a73a79c71255:5eb27ee0-772c-4d65-bc0f-4e56b6b7913b Interrupted while waiting for state JobState [state=FINISHED, reason=Job finished after restart, date=Mon Aug 22 16:09:45 GMT 2022] INFO [2022-08-22 16:19:34,944] [cfd95b7e-3bc7-4006-a4a8-a73a79c71255:1100] c.i.i.t.o.f.j.LoadFileJob - cfd95b7e-3bc7-4006-a4a8-a73a79c71255:1100 Observer job finished with state STOPPED
Some jobs might still be shown with
Error
if an attempt was made to run the stuck job before the previous remediating steps were attempted. This causes the observer to fail the job, as a job already exists with the same name. In this scenario, a warning similar to the following is seen in the observer log:WARN [2022-08-23 15:58:50,431] [pool-10-thread-1] c.i.i.t.o.a.o.RESTApiException - pool-10-thread-1 - APIMessage: { httpCode=422, _error={ message=Job Creation Failure, level=warning, description=Cannot create observation job, causes=[{ message=Duplicate unique id, level=error, description=A job with the same unique id '1100' is already being processed., field=unique_id }] } }
If there are no longer any observer jobs with a Status of
Error
, then exit this troubleshooting. -
-
Complete the following steps only if one or more of your observer jobs are still shown with a Status of
Error
.-
Find the name of the topology pod.
oc get pods | grep topology-topology
-
Remote shell into the topology container.
oc rsh <topology_pod> -c <release_name>-topology-topology
Where
<topology_pod>
is the pod that was returned in the previous step.<release_name>
is the name of your IBM Cloud Pak for AIOps instance.
-
-
Find observers that still have a job that is stuck in the
FINISHING
state.Run the following command, and note down the returned
observerName
and_id
.curl -X GET "https://localhost:8080/1.0/topology/mgmt_artifacts?_filter=hasState%3DFINISHING&_field=keyIndexName&_field=hasState&_field=observerName&_type=ASM_OBSERVER_JOB" -H "accept: application/json" -H "X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255" --insecure -u ${ASM_USER}:${ASM_PASS} | jq
Note: ASM_USER and ASM_PASS are environment variables available within the topology container.
Example output:
{ "_executionTime": 13, "_offset": 0, "_limit": 50, "_items": [ { "keyIndexName": "dns-observer:bbc", "_id": "z7ovGeNrSKSpMlWY8n7Mfw", "observerName": "dns-observer", "hasState": "FINISHING" } ] }
-
Run the following steps for each observer job returned by the previous step.
-
Change the state of the job by running the following command.
curl -X POST "https://localhost:8080/1.0/topology/mgmt_artifacts/<id>" -H "accept: application/json" -H "X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255" -H "Content-Type: application/json" -d "{ \"hasState\": \"FINISHED\"}" --insecure -u ${ASM_USER}:${ASM_PASS}
Where
<id>
is the_id
of the stuck job that you identified in step 4. -
Find the pods for the observer that you need to restart.
oc get pods -n <namespace> | grep <stuck_observer_name>
Where
<stuck_observer_name>
is the name of the stuck observer that you identified in step 4.<namespace>
is the project (namespace) where IBM Cloud Pak for AIOps is installed.
-
Delete the pods for the observer.
oc delete po <observername_pods> -n <namespace>
Where
<observername_pods>
are the pods returned in the previous step.<namespace>
is the project (namespace) where IBM Cloud Pak for AIOps is installed.
-
Rerun the observer job.
On the observer configuration UI, click Run on any job for the observer that is in the
Error
state. The observer job now runs successfully, and has a status ofScheduled
orReady
.
-
Restore process terminated mid-process with partial data available
If the restore process does not complete as expected for a data store, such as if it is aborted or terminated during its run, data might not be restored correctly. This incomplete restored data neeeds to be removed before you run the restore process again. To remove the data, you need to run a post-restore cleanup script to clean up the data store.
To run a script, complete the following steps:
-
Define the following environment variable on your workstation:
export WORKDIR="${PATH}/bcdr/4.9.0/"
-
Change to the
restore
directory where the post-restore cleanup script is located:cd <Path>/bcdr/4.9.0/restore/<data_store>/
Where
<Path>
is the path to where you downloaded and extracted the IBM Cloud Pak® for AIOps backup and restore files.<data_store>
is the directory for the data store or custom resource that needs to be cleaned up. For example,couchdb
is the directory for CouchDb.
-
Run the post-restore cleanup script:
nohup ./<data-store>-post-restore.sh.sh > <data-store>-post-restore.log &
Where
<data_store>
is the data store or custom resource that needs to be cleaned up. -
Run the restore process for that data store or resource again.
For example, if the restore job for restoring the Cassandra data store aborted, you need to run the cassandra-post-restore.sh
post-restore script that is stored in the bcdr/restore/cassandra
directory to clean the
data. Then, run the cassandra-native-post-restore.sh
post-restore script.
The following table lists the cleanup script to run for each data store:
Component or Database | Cleanup script |
---|---|
Cassandra |
Run first: bcdr/4.9.0/restore/cassandra/cassandra-post-restore.sh Then, run: bcdr/4.9.0/restore/cassandra/cassandra-native-post-restore.sh |
CouchDB |
bcdr/4.9.0/restore/couchdb/couchdb-post-restore.sh |
Elasticsearch |
bcdr/4.9.0/restore/elasticsearch/es-post-restore.sh |
Metastore |
bcdr/4.9.0/restore/metastore/metastore-post-restore.sh |
Minio |
bcdr/4.9.0/restore/minio/minio-post-restore.sh |
Postgres |
bcdr/4.9.0/restore/postgres/postgres-post-restore.sh |
IBM Cloud Pak foundational services | bcdr/4.9.0/restore/common-services/cs-post-restore.sh |
Connection CR | N/A |
Secure Tunnel CR | bcdr/4.9.0/restore/other-resources/tunnel-cr-post-restore.sh |
OADP restore is stuck in an In progress
state
If you notice that the Velero pod is stuck in an In progress
state, complete the following steps to stop the restore process:
-
Delete the velero pod by running the following command:
oc delete pod <velero pod name> -n <OADP installed namespace>
-
Delete the restore that is stuck in progress by running the following command:
velero delete restore <restore name>
Wait for the completion of the restore job.
Helm install restore job command failed
When you are running the helm install restore-job clusterrestore-0.1.0.tgz
command, you might encounter the command failing with an error that is similar to the following error:
Error: admission webhook "trust.hooks.securityenforcement.admission.cloud.ibm.com" denied the request:
Deny "icr.io/cpopen/cp4waiops/cp4aiops-bcdr@sha256:294a42a851a2717ebbc68528ab3c6bcb1ba48114ff058f1c1b537dc6aa167355", no matching repositories in ClusterImagePolicy and no ImagePolicies in the "velero" namespace
If you encounter this error, complete the following steps to resolve the issue:
-
Uninstall the
restore-job
job by running the following command:helm uninstall restore-job -n $OADP_NAMESPACE
-
Export an environment variable for the image.
For an online deployment:
export REGISTRY=icr.io/cpopen/cp4waiops/cp4aiops-bcdr export BCDR_IMAGE=${REGISTRY}/<bcdr_image>
For an offline deployment:
export REGISTRY=$TARGET_REGISTRY export BCDR_IMAGE=${REGISTRY}/<bcdr_image>
Where
<bcdr_image>
is the name of the backup and restore image, as given in the backup helm chartvalues.yaml
file, in the formcp4waiops-bcdr@{digest}
. An example value for BCDR_IMAGE isicr.io/cpopen/cp4waiops/cp4waiops-bcdr@sha256:294a42a851a2717ebbc68528ab3c6bcb1ba48114ff058f1c1b537dc6aa167355
. -
Create a
restore-image-policy.yaml
file and add the following content within the file:apiVersion: securityenforcement.admission.cloud.ibm.com/v1beta1 kind: ClusterImagePolicy metadata: name: restore-image-policy spec: repositories: - name: ${BCDR_IMAGE} policy:
-
Apply the policy by running the following command:
oc apply -f restore-image-policy.yaml
-
Deploy the restore job by running the following command:
helm install restore-job clusterrestore-0.1.0.tgz
Cassandra restore fails with org.apache.cassandra.io.FSReadError
exception
Policies are not restored after restore completed. Although there were a couple of key repair failures, the log indicated the Cassandra restore completed. However, the policies from the backup cluster were not restored.
You can encounter an issue where the Cassandra restore fails. If this issue occurs, the aiops-topology-cassandra-2
pod is in 0/1
status and the pod log contains an error message that is similar to the following error
message:
ERROR [CompactionExecutor:8] 2022-11-29 21:39:17,324 CassandraDaemon.java:244 - Exception in thread Thread[CompactionExecutor:8,1,main]
org.apache.cassandra.io.FSReadError: java.io.IOException: Channel not open for writing - cannot extend file to required size
If this error occurs, complete the following steps for the Cassandra statefulset:
-
Scale down the statefulset
aiops-topology-cassandra
to0
.Important: Make a note of the current scaling of the Cassandra statefulset before you scale down.
oc scale statefulsets aiops-topology-cassandra --replicas 0 -n <namespace>
Where:
<namespace>
is the namespace where IBM Cloud Pak for AIOps is installed. -
Increase the statefulset memory limit to 32 GB.
-
Access the resource for editing:
oc edit statefulset aiops-topology-cassandra -n <namespace>
-
Increase the memory limit:
-
-
Scale the statefulset back to the initial number of replicas.
oc scale statefulsets aiops-topology-cassandra --replicas=<number_of_replicas> -n <namespace>
Where:
<namespace>
is the namespace where IBM Cloud Pak for AIOps is installed.<number_of_replicas>
is the number of replicas the StatefulSet it to be scaled up to.
-
Run the restore again.
-
Scale down the statefulset
aiops-topology-cassandra
to0
again.oc scale statefulsets aiops-topology-cassandra --replicas 0 -n <namespace>
-
Change the statefulset memory limit to 16 GB .
-
Access the resource for editing:
oc edit statefulset aiops-topology-cassandra -n <namespace>
-
Decrease the memory limit to 16 GB.
-
-
Scale the statefulset back to the initial number of replicas.
oc scale statefulsets aiops-topology-cassandra --replicas=<number_of_replicas> -n <namespace>
Where:
<namespace>
is the namespace where IBM Cloud Pak for AIOps is installed.<number_of_replicas>
is the number of replicas the StatefulSet it to be scaled up to.
Integrations missing after a restore
After you complete a full restore, if you do not see all of the data integrations that you expect, run the command to restore data for the Metastore
component to restore the integrations. For instructions, see Restoring individual components.
Netcool integration is missing details after restore
If you had a Netcool integration defined, then the Netcool integration might be missing some details after restore, such as the ObjectServer details. If you try to edit these details then the test still fails. To resolve this problem, delete the Netcool integration, and then create a new Netcool integration.
Restore fails with lifecycletrigger: Not Ready
If lifecycletrigger
is showing Not Ready after the restore, then run the following command:
oc patch lifecycletrigger aiops --type=json --patch="$(oc get lifecycletrigger aiops -o jsonpath='{"[{"}"op":"add","path":"/spec/cancelJobs","value":[{range .status.jobs[*]}"{.jid}",{end}]}]' | sed 's/,]/]/g')"
Cannot login to the Cloud Pak for AIOps console on the restore cluster
After the restore, the IBM Cloud Pak for AIOps console is inaccessible and there is an error similar to the following:
`CWOAU0061E: The OAuth service provider could not find the client because the client name is not valid`
Solution: Run the following steps on the restore cluster:
-
Export an environment variable containing your project's name.
export PROJECT=<project>
Where
<project>
is the namespace (project) that IBM Cloud Pak for AIOps is deployed in on the restore cluster. -
Delete the platform pods.
oc delete pod -n ${PROJECT} -l component=platform-auth-service oc delete pod -n ${PROJECT} -l component=platform-identity-management oc delete pod -n ${PROJECT} -l component=platform-identity-provider
-
Update and delete the
iam-config-job
YAML.oc get -n ${PROJECT} job iam-config-job -o json > /tmp/iam-config-job.json oc -n ${PROJECT} delete job iam-config-job jq 'del(.metadata.creationTimestamp) | del(.metadata.managedFields) | del(.metadata.resourceVersion) | del(.metadata.uid) | del(.spec.selector) | del(.spec.template.metadata.labels) | del(.status)' /tmp/iam-config-job.json > /tmp/updated-iam-config-job.json
-
Wait for a few seconds and then apply the modified YAML to rerun
iam-config-job
.oc -n ${PROJECT} apply -f /tmp/updated-iam-config-job.json oc -n ${PROJECT} get pods | grep iam-config-job
Policy configuration is not retained on the restored deployment
Incident policies that previously had a ticket integration are missing the ticket integration on the restored deployment. If GitHub or ServiceNow are configured, then external tickets are not created in them. The policy page has a Warning similar to the following example:
The previously selected ticket connection Github -mygit is no longer available. Please check the status of your connector.
Solution: From the IBM Cloud Pak for AIOps console, edit the policy and add the connection that is shown in the warning.
Troubleshooting the Infrastructure Automation restore
If you are also restoring Infrastructure Automation data and encounter an issues with the restore process for Infrastructure Automation or encounter an issue with data not being available or processed after the restore, see Troubleshooting the Infrastructure Automation restore
Rebroadcast data to ElasticSearch fails for Cassandra restore
You might notice that the rebroadcast data to ElasticSearch fails when you try to restore Cassandra. The error might be shown in aiops-topology-topology-xxx
pod:
===== EXECUTING COMMAND in pod: aiops-topology-topology-55f5dfc5db-xjdxt =====
Defaulted container "aiops-topology-topology" out of: aiops-topology-topology, wait-for-cassandra (init)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 -::- -::- -::- 0* Trying ::1...
TCP_NODELAY set connect to ::1 port 8080 failed: Connection refused Trying 127.0.0.1... TCP_NODELAY set connect to 127.0.0.1 port 8080 failed: Connection refused Failed to connect to localhost port 8080: Connection refused Closing connection 0
curl: (7) Failed to connect to localhost port 8080: Connection refused
command terminated with exit code 7
Use the following steps to resolve the restore issue with Cassandra. Apply this workaround on restore cluster after the restore operation:
-
Run the following command for timestamp:
echo "[INFO] $(date) Rebroadcasting data to ElasticSearch"
-
Set the project (namespace) where Cloud Pak for AIOps is installed:
namespace=<namespace>
-
Save the username and password in environment variables:
ASM_USER=$(kubectl -n $namespace get secret aiops-topology-asm-credentials -o jsonpath="{.data['username']}" | base64 -d) ASM_PASS=$(kubectl -n $namespace get secret aiops-topology-asm-credentials -o jsonpath="{.data['password']}" | base64 -d)
-
Rebroadcast a data point in the Cassandra cluster:
kubectl exec ${cassandra-pod} -n $namespace -- bash -c "curl -vX POST 'https://localhost:8080/1.0/topology/crawlers/rebroadcast' -H 'X-TenantID: cfd95b7e-3bc7-4006-a4a8-a73a79c71255' --insecure -u $ASM_USER:$ASM_PASS"
In addition, review the Known issues with the backup and restore process.
Known issues with the backup and restore process
Elasticsearch health status yellow after restore
When you are restoring an Elasticsearch backup to a new single node Elasticsearch cluster, the Elasticsearch database might not work as expected. Instead, the Elasticsearch cluster health shows a yellow status after the restore completes.