Restoring database data (OCP)

Edit online

You can restore existing topology data for the Agile Service Manager OCP installation, if backed up earlier. This can be helpful when updating your system, or for maintenance reasons.

Before you begin

Before restoring a Cassandra database, you must back it up, as described in the following topic :Backing up database data (OCP)

About this task

Assumption: The backup and restore procedures in these topics assume a standard Agile Service Manager NOI deployment (and not an standalone deployment), a shared use of Cassandra, and that the release name used is 'noi'. Adjust the samples provided to your circumstances.

Backup: The backup procedure documented here performs a backup of all the keyspaces in the Cassandra database, including those not specific to Agile Service Manager.
Restore: The restore procedures focus on restoring only the keyspace that is relevant to Agile Service Manager (that is, 'janusgraph').

Secrets and system_auth keyspace:

During Agile Service Manager deployment, a secret called {release}-topology-cassandra-auth-secret is generated, if none already exists with that name. Cassandra is protected with the user and password of that secret, which will be used by the Agile Service Manager services to connect to the database.

In the restore scenarios described, it is assumed that Agile Service Manager is deployed in a standard way, meaning that the connection to Cassandra is set with the described secret. If you were to restore the system_auth keyspace (instead of just the janusgraph keyspace), you would have to make sure the user and password in the mentioned secret matches the credentials contained in the keyspace for the version being restored.

The following command 'exports' the secret from your Kubernetes cluster. However, the process of backing up and restoring secrets is not described here, as it is contingent on your company security policies, which you should follow.

kubectl get secret noi-cassandra-auth-secret -o yaml

Restore scenarios: You may encounter a variety of typical data restore scenarios while administering Agile Service Manager on OCP.

Same cluster data restore (rollback): This scenario covers the restore of a database backup to the same cluster from where it was taken, as documented in the procedure; essentially rolling back your deployment to a previous state.; You typically perform such a data 'rollback' due to data corruption, or the need to revert some changes made to your Agile Service Manager deployment.
Restoring to a different cluster: In this scenario you recreate your Agile Service Manager deployment data in a different cluster from the one from which you have taken a backup.; You must have Agile Service Manager successfully deployed on your new target cluster.; Typically, you would restore data to a different cluster in a disaster recovery situation, where your primary cluster is not accessible; or in a situation where you want to clone a production system to a test environment.; The backup procedure stores the backup generated files inside the Agile Service Manager Cassandra pods inside the /opt/ibm/cassandra/data/backup_tar/ directory. Ensure that these backup files are present in the target cluster Cassandra pods before attempting this scenario; either copy them to the new location, or mount the external storage to that location. Once the backup files are in the correct target location, you restore the backed up data as documented in the procedure.
Tip: The backup_tar directory may not exist if you did not set a storageClass for cassandrabak during the Agile Service Manager installation.
Losing a Cassandra node in your OCP cluster: This scenario describes the steps to perform should you lose a worker node in your OCP cluster where one of the Agile Service Manager Cassandra pods is running, thereby effectively losing one of your Cassandra replica nodes.; This might happen for a variety of reasons and will leave your Cassandra cluster with two remaining functioning nodes, to which you then add a new node to restore your three-node configuration.; See the Replacing a Cassandra node in your OCP cluster topic for more details.

Procedure

Preparing your system for data restoration

Restoring to a different cluster: The backup procedure stores the backup-generated files inside the Agile Service Manager Cassandra pods inside the /opt/ibm/cassandra/data/backup_tar/ directory. If you are restoring your data to a different cluster, ensure that these backup files are present in the target cluster Cassandra pods: either copy them to the new location, or mount the external storage to that location.

Authenticate into the Agile Service Manager Kubernetes namespace.

Deploy the kPodLoop bash shell function.

kPodLoop is a bash shell function that allows a command to be run against matching Kubernetes containers. You can copy it into the shell.

kPodLoop() {
 __podPattern=$1
 __podCommand=$2
 __podList=$( kubectl get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=NAME:.metadata.name | grep ${__podPattern} )
 printf "Pods found: $(echo -n ${__podList})\n"
 for pod in ${__podList}; do
    printf "\n===== EXECUTING COMMAND in pod: %-42s =====\n" ${pod}
    kubectl exec ${pod} -- bash -c "${__podCommand}"
    printf '_%.0s' {1..80}
    printf "\n"
 done;
}

This kPodLoop bash shell function filters the pods to run the commands against only those that are in a 'Running' phase. This filter ensures that the configuration pods that are only run as part of your installation, like the secret generator pod, are skipped.

Make a note of the scaling of Agile Service Manager pods.

kubectl get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=CNAME:.metadata.ownerReferences[0].name | grep noi-topology | uniq --count

Example output:

1 noi-topology-dns-observer
1 noi-topology-docker-observer
3 noi-topology-elasticsearch
1 noi-topology-file-observer
1 noi-topology-kubernetes-observer
1 noi-topology-layout
1 noi-topology-merge
1 noi-topology-noi-gateway
1 noi-topology-noi-probe
1 noi-topology-observer-service
1 noi-topology-search
1 noi-topology-status
1 noi-topology-topology
1 noi-topology-ui-api

Verify access to each Cassandra database (this command will return a list of keyspaces from each Cassandra node).

kPodLoop noi-cassandra "CASSANDRA_USER=\$(cat \$CASSANDRA_AUTH_USERNAME_FILE); CASSANDRA_PASS=\$(cat \$CASSANDRA_AUTH_PASSWORD_FILE); cqlsh -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -e \"DESC KEYSPACES;\""

Scale down Agile Service Manager pods.

kubectl scale deployment --replicas=0 noi-topology-dns-observer
kubectl scale deployment --replicas=0 noi-topology-file-observer
kubectl scale deployment --replicas=0 noi-topology-kubernetes-observer
kubectl scale deployment --replicas=0 noi-topology-observer-service
kubectl scale deployment --replicas=0 noi-topology-noi-gateway
kubectl scale deployment --replicas=0 noi-topology-noi-probe
kubectl scale deployment --replicas=0 noi-topology-layout
kubectl scale deployment --replicas=0 noi-topology-merge
kubectl scale deployment --replicas=0 noi-topology-status
kubectl scale deployment --replicas=0 noi-topology-search
kubectl scale deployment --replicas=0 noi-topology-ui-api
kubectl scale deployment --replicas=0 noi-topology-topology

The Cassandra and ElasticSearch pods (noi-cassandra and noi-topology-elasticsearch) are left active. Cassandra pods need to be running in order to execute the backup of their data, whereas the ElasticSearch pods have no interaction with nor influence on the Cassandra contents, so can be kept running.

Important: Include in this scale down any additional observers you have installed in your deployment.

Verify that scaling down was successful.
```
kubectl get pods --field-selector=status.phase=Running  | grep noi-topology
```
The Agile Service Manager services have now been scaled down, and the Cassandra database contents will not be modified anymore.

Restore data

Update the Cassandra restore script to suppress the truncation of restored data.

Note: The restore_cassandra.sh tool truncates all data in the target table each time it is used, and despite the restore being targeted at one Cassandra node only, the truncate is propagated to all nodes. In order to suppress the truncate step, you must update the restore script on all but the first node.
1. Copy cassandra_functions.sh out of one of the asm-cassandra nodes.
```
kubectl cp noi-cassandra-0:/opt/ibm/backup_scripts/cassandra_functions.sh /tmp/cassandra_functions.sh
```
2. Edit cassandra_functions.sh
```
vi /tmp/cassandra_functions.sh
```
  Locate the call to truncate_all_tables within the restore() function and comment out the appropriate lines, as in the following example:
```
Printf "`date` Starting Restore \n"

#### truncate_all_tables
#### testResult $? "truncate tables"

repair_keyspace
```
3. Save the file, then copy the file back to all nodes, except the first Cassandra node.
```
kubectl cp /tmp/cassandra_functions.sh noi-cassandra-1:/opt/ibm/backup_scripts/cassandra_functions.sh

kubectl cp /tmp/cassandra_functions.sh noi-cassandra-2:/opt/ibm/backup_scripts/cassandra_functions.sh
```

Locate the timestamps of the backups from each Cassandra node to restore.

Each node's backup was started at a similar time, so the timestamps may differ by a few seconds. In the following example a backup was performed at about 2019-06-11 09:36, and grep is then used to filter to these backup archives.

Tip: You can ignore this step if you are about to apply the most recent backup. If you do, the -t parameter can be omitted during all subsequent steps.

kPodLoop noi-cassandra "ls -larth \${CASSANDRA_DATA}/../backup_tar | grep 2019-06-11-09"

Pods found: noi-cassandra-0 noi-cassandra-1 noi-cassandra-2

===== EXECUTING COMMAND in pod: noi-cassandra-0                            =====
-rwxrwxr-x 1 cassandra cassandra 524M Jun 11 09:37 cassandra_noi-topology-cassandra-0_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-04.tar
________________________________________________________________________________

===== EXECUTING COMMAND in pod: noi-cassandra-1                            =====
-rwxrwxr-x 1 cassandra cassandra 565M Jun 11 09:37 cassandra_noi-topology-cassandra-1_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-07.tar
________________________________________________________________________________

===== EXECUTING COMMAND in pod: noi-cassandra-2                            =====
-rwxrwxr-x 1 cassandra cassandra 567M Jun 11 09:37 cassandra_noi-topology -cassandra-2_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-07.tar
________________________________________________________________________________

Working across each Cassandra node, restore the relevant backup of the janusgraph keyspace.

Note: For information about the system_auth keyspace see the Secrets and system_auth keyspace note.

The specific node pod names can vary depending on your installation and release names used

noi-cassandra-0

Remember: This will cause the existing data in the janusgraph keyspace tables to be truncated.

kPodLoop noi-cassandra-0 "/opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph -f"

 kPodLoop noi-cassandra-0 "nodetool repair --full janusgraph"

noi-cassandra-1

kPodLoop noi-cassandra-1 "/opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph -f"

 kPodLoop noi-cassandra-1 "nodetool repair --full janusgraph"

noi-cassandra-2

kPodLoop noi-cassandra-2 "/opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph -f"

 kPodLoop noi-cassandra-2 "nodetool repair --full janusgraph"

Scale up the services to the original level.

The original level was obtained in a previous step.

kubectl scale deployment --replicas=1 noi-topology-topology
kubectl scale deployment --replicas=1 noi-topology-layout
kubectl scale deployment --replicas=1 noi-topology-merge
kubectl scale deployment --replicas=1 noi-topology-status
kubectl scale deployment --replicas=1 noi-topology-search
kubectl scale deployment --replicas=1 noi-topology-observer-service
kubectl scale deployment --replicas=1 noi-topology-noi-gateway
kubectl scale deployment --replicas=1 noi-topology-noi-probe
kubectl scale deployment --replicas=1 noi-topology-ui-api
kubectl scale deployment --replicas=1 noi-topology-dns-observer
kubectl scale deployment --replicas=1 noi-topology-file-observer
kubectl scale deployment --replicas=1 noi-topology-rest-observer
kubectl scale deployment --replicas=1 noi-topology-kubernetes-observer

After you have run the restore script and scaled up the services, clear historical alert data using the statusClear crawler.
```
https://master_fqdn/1.0/topology/swagger#!/Crawlers/statusClear
```

Restore services

Rebroadcast data to ElasticSearch (that is, re-index Elasticsearch).
If data in Elasticsearch is out of sync with data in the Cassandra database, resynchronize it by calling the rebroadcast API of the topology service. This triggers the rebroadcast of all known resources on Kafka, and the Search service will then index those resources in Elasticsearch. Call the rebroadcast API of the Topology service, specifying a tenantId:
```
https://master_fqdn/1.0/topology/swagger#!/Crawlers/rebroadcastTopology
```