Restoring database data in a geo-redundant deployment

You can restore existing topology data for the Agile Service Manager Red Hat® OpenShift® installation, if backed up earlier. The backing up of data is helpful during your system update, or during maintenance.

Before you begin

Before restoring a Cassandra database, you must back it up as described in the following topic: Backing up database data (OCP)

About this task

Assumption: The backup and restore procedures in these topics assume a standard Agile Service Manager Netcool® Operations Insight® deployment (and not a standalone deployment), a shared use of Cassandra, and that the release name used is 'noi'. Adjust the samples provided to your circumstances.
Backup
The backup procedure documented here performs a backup of all the keyspaces in the Cassandra database, including those not specific to Agile Service Manager.
Restore
The restore procedures focus on restoring only the keyspace that is relevant to Agile Service Manager (that is, 'janusgraph').
Secrets and system_auth keyspace:

During Agile Service Manager deployment, a secret called {release}-topology-cassandra-auth-secret is generated, if none exists with that name. Cassandra is protected with the username and password of secret, which is used by the Agile Service Manager services to connect to.

In the restore scenarios described, it is assumed that Agile Service Manager is deployed in a standard way, meaning that the connection to Cassandra is set with the described secret. If you want to restore the system_auth keyspace (instead of just the janusgraph keyspace), you must make sure that the username, and password in the mentioned secret matches the credentials in the keyspace for the restored version.

The following command 'exports' the secret from your Kubernetes cluster. However, the process of backing up and restoring secrets is not described here, as it is contingent on your company security policies, which you must follow.
kubectl get secret noi-cassandra-auth-secret -o yaml
Restore scenarios: You might encounter various typical data restore scenarios while administering Agile Service Manager on Red Hat OpenShift.
Same cluster data restore (rollback)
This scenario covers the restore of a database backup to the same cluster from where it was taken, as documented in the procedure; essentially rolling back your deployment to a previous state.
You perform a data 'rollback' due to data corruption, or the need to revert some changes that are made to your Agile Service Manager deployment.
Restoring to a different cluster
In this scenario, you recreate your Agile Service Manager deployment data in a different cluster from the one you have taken a backup.
You must have Agile Service Manager successfully deployed on your new target cluster.
Typically, you would restore data to a different cluster in a disaster recovery situation, where your primary cluster is not accessible; or in a situation where you want to clone a production system to a test environment.
The backup procedure stores the files that are generated through backup inside the Agile Service Manager Cassandra pods in the /opt/ibm/cassandra/data/backup_tar/ directory. Ensure that these backup files are present in the target cluster Cassandra pods before attempting this scenario; either copy them to the new location, or mount the external storage to that location. Once the backup files are in the correct target location, you restore the backed up data as documented in the procedure.
Tip: The backup_tar directory might not exist if you did not set a storageClass for cassandrabak during the Agile Service Manager installation.
Losing a Cassandra node in your Red Hat OpenShift cluster
This scenario describes the steps to perform when you lose a worker node in your Red Hat OpenShift cluster where one of the Agile Service Manager Cassandra pods is running, thereby effectively losing one of your Cassandra replica nodes.
This might happen for various reasons and leaves your Cassandra cluster with two remaining functioning nodes, to which you then add a node to restore your three-node configuration.

Procedure

Preparing your system for data restoration

Restoring to a different cluster: The backup procedure stores the backup-generated files inside the Agile Service Manager Cassandra pods inside the /opt/ibm/cassandra/data/backup_tar/ directory. If you are restoring your data to a different cluster, ensure that these backup files are present in the target cluster Cassandra pods: either copy them to the new location, or mount the external storage to that location.

  1. Authenticate into the Agile Service Manager Kubernetes namespace on the primary data center.
  2. Deploy the kPodLoop bash shell function.
    kPodLoop is a bash shell function that allows a command to be run against matching Kubernetes containers. You can copy it into the shell.
    kPodLoop() {
     __podPattern=$1
     __podCommand=$2
     __podList=$( kubectl get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=NAME:.metadata.name | grep ${__podPattern} )
     printf "Pods found: $(echo -n ${__podList})\n"
     for pod in ${__podList}; do
        container=`oc get pods ${pod} -o jsonpath='{.spec.containers[*].name}' | tr ' ' '\n' | head -1`
        printf "\n===== EXECUTING COMMAND in pod: %-42s =====\n" ${pod}
        oc exec -c ${container} ${pod} -- bash -c "${__podCommand}"
        printf '_%.0s' {1..80}
        printf "\n"
     done;
    }
    kPodLoop bash shell function filters the pods to run the commands against those pods only that are in a 'Running' phase. This filter ensures that the configuration pods that are only running as part of your installation, like the secret generator pod, are skipped.
  3. Make a note of the scaling of Agile Service Manager pods.
    oc get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=CNAME:.metadata.ownerReferences[0].name | grep primary-topology | uniq --count
    Example output:
    1 primary-topology-dns-observer
    1 primary-topology-docker-observer
    3 primary-topology-elasticsearch
    1 primary-topology-file-observer
    1 primary-topology-kubernetes-observer
    1 primary-topology-layout
    1 primary-topology-merge
    1 primary-topology-noi-gateway
    1 primary-topology-noi-probe
    1 primary-topology-observer-service
    1 primary-topology-search
    1 primary-topology-status
    1 primary-topology-topology
    1 primary-topology-ui-api
  4. Verify access to each Cassandra database (this command returns a list of keyspaces from each Cassandra node).
    kPodLoop primary-cassandra "CASSANDRA_USER=\$(cat \$CASSANDRA_AUTH_USERNAME_FILE); CASSANDRA_PASS=\$(cat \$CASSANDRA_AUTH_PASSWORD_FILE); cqlsh -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} \${CASSANDRA_CLIENT_ENCRYPTION_ENABLED:+--ssl} -e \"DESC KEYSPACES;\""
    
  5. Scale down Agile Service Manager pods.
    oc scale deployment --replicas=0 primary-topology-dns-observer
    oc scale deployment --replicas=0 primary-topology-file-observer
    oc scale deployment --replicas=0 primary-topology-kubernetes-observer
    oc scale deployment --replicas=0 primary-topology-observer-service
    oc scale deployment --replicas=0 primary-topology-noi-gateway
    oc scale deployment --replicas=0 primary-topology-noi-probe
    oc scale deployment --replicas=0 primary-topology-layout
    oc scale deployment --replicas=0 primary-topology-merge
    oc scale deployment --replicas=0 primary-topology-status
    oc scale deployment --replicas=0 primary-topology-search
    oc scale deployment --replicas=0 primary-topology-ui-api
    oc scale deployment --replicas=0 primary-topology-topology

    The Cassandra pods are left active. Cassandra pods need to be kept running in order to execute the backup of their data.

    Important: Include in this scale down any additional observers that you installed in your deployment.
  6. Verify that scaling down was successful.
    oc get pods --field-selector=status.phase=Running  | grep primary-topology
    The Agile Service Manager services have now been scaled down, and the Cassandra database contents will not be modified anymore.
  7. Repeat steps 1 - 6 for the backup data center.

Restore data

  1. On the primary data center, update the Cassandra restore script to suppress the truncation of restored data and the copying of data to the backup date center.
    Note: The restore_cassandra.sh tool truncates all the data in the target table each time it is used, and despite the restore being targeted at one Cassandra node only, the truncate is propagated to all nodes. In order to suppress the truncate step, you must update the restore script on all the nodes except the first node.
    1. Copy cassandra_functions.sh out of one of the asm-cassandra nodes.
      oc cp primary-cassandra-0-0:/opt/ibm/backup_scripts/cassandra_functions.sh /tmp/cassandra_functions.sh

      Edit cassandra_functions.sh

      vi /tmp/cassandra_functions.sh
    2. Locate the call to truncate_all_tables within the restore() function and comment out the appropriate lines, as in the following example:
      Printf "`date` Starting Restore \n"
      
      #### truncate_all_tables
      #### testResult $? "truncate tables"
    3. Save the file. Then, copy the file back to all nodes, except the first Cassandra node.
      oc cp /tmp/cassandra_functions.sh primary-cassandra-1-0:/opt/ibm/backup_scripts/cassandra_functions.sh
      oc cp /tmp/cassandra_functions.sh primary-cassandra-2-0:/opt/ibm/backup_scripts/cassandra_functions.sh
  2. Locate the timestamps of restoring the backups from each Cassandra node.
    Each node's backup was started at a similar time, so the timestamps may differ by a few seconds. In the following example a backup was performed at about 2019-06-11 09:36, and grep is then used to filter to these backup archives.
    Tip: You can ignore this step if you are about to apply the most recent backup. If you do, the -t parameter can be omitted during all subsequent steps.
    kPodLoop primary-cassandra "ls -larth \${CASSANDRA_DATA}/../backup_tar | grep 2019-06-11-09"
    
    Pods found: primary-cassandra-0-0 primary-cassandra-1-0 primary-cassandra-2-0
    
    ===== EXECUTING COMMAND in pod: primary-cassandra-0-0                            =====
    -rwxrwxr-x 1 cassandra cassandra 524M Jun 11 09:37 cassandra_primary-topology-cassandra-0-0_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-04.tar
    ________________________________________________________________________________
    
    ===== EXECUTING COMMAND in pod: primary-cassandra-1-0                            =====
    -rwxrwxr-x 1 cassandra cassandra 565M Jun 11 09:37 cassandra_primary-topology-cassandra-1-0_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-07.tar
    ________________________________________________________________________________
    
    ===== EXECUTING COMMAND in pod: primary-cassandra-2-0                            =====
    -rwxrwxr-x 1 cassandra cassandra 567M Jun 11 09:37 cassandra_primary-topology-cassandra-2-0_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-07.tar
    ________________________________________________________________________________
  3. Working across each Cassandra node in the primary data center, restore the relevant backup of the janusgraph keyspace.
    Note: For information about the system_auth keyspace see the Secrets and system_auth keyspace note.
    The specific node pod names can vary depending on your installation and release names used
    1. primary-cassandra-0-0
      Remember: This will cause the existing data in the janusgraph keyspace tables to be truncated.
      kPodLoop primary-cassandra-0-0 "CASSANDRA_USER=\$(cat \$CASSANDRA_AUTH_USERNAME_FILE); CASSANDRA_PASS=\$(cat \$CASSANDRA_AUTH_PASSWORD_FILE); /opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph  -t 2019-06-11-0936-04 -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -f"
      
    2. primary-cassandra-1-0
      kPodLoop primary-cassandra-1-0 "CASSANDRA_USER=\$(cat \$CASSANDRA_AUTH_USERNAME_FILE); CASSANDRA_PASS=\$(cat \$CASSANDRA_AUTH_PASSWORD_FILE); /opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph  -t 2019-06-11-0936-04 -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -f"
      
    3. primary-cassandra-2-0
      kPodLoop primary-cassandra-2-0 "CASSANDRA_USER=\$(cat \$CASSANDRA_AUTH_USERNAME_FILE); CASSANDRA_PASS=\$(cat \$CASSANDRA_AUTH_PASSWORD_FILE); /opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph  -t 2019-06-11-0936-04 -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -f"
      
  4. For one node in the primary data center, repair the janusgraph key space.
    kPodLoop primary-cassandra-0 "nodetool repair --full --local janusgraph"
  5. For each node in the backup data center, rebuild the Cassandra node from the primary data center.
    1. backup-cassandra-0-0
      kPodLoop backup-cassandra-0-0 "nodetool rebuild -- dc-1"
    2. backup-cassandra-1-0
      kPodLoop backup-cassandra-1-0 "nodetool rebuild -- dc-1"
    3. backup-cassandra-2-0
      kPodLoop backup-cassandra-2-0 "nodetool rebuild -- dc-1"
  6. After you run the rebuild scripts, clear historical alert data by using the statusClear crawler from the primary data center.
    https://master_fqdn/1.0/topology/swagger#!/Crawlers/statusClear

Restore services

  1. Scale up the services to the original level.
    The original level was obtained in a previous step.
    oc scale deployment --replicas=1 primary-topology-topology
        oc scale deployment --replicas=1 primary-topology-layout
        oc scale deployment --replicas=1 primary-topology-merge
        oc scale deployment --replicas=1 primary-topology-status
        oc scale deployment --replicas=1 primary-topology-search
        oc scale deployment --replicas=1 primary-topology-observer-service
        oc scale deployment --replicas=1 primary-topology-noi-gateway
        oc scale deployment --replicas=1 primary-topology-noi-probe
        oc scale deployment --replicas=1 primary-topology-ui-api
        oc scale deployment --replicas=1 primary-topology-dns-observer
        oc scale deployment --replicas=1 primary-topology-file-observer
        oc scale deployment --replicas=1 primary-topology-rest-observer
        oc scale deployment --replicas=1 primary-topology-kubernetes-observer
  2. Scale up the services in the backup data center to the original level.
  3. Rebroadcast data to inventory.
    If data in the inventory service is out of sync with data in the Cassandra database, resynchronize it by calling the rebroadcast API of the topology service. This triggers the rebroadcast of all known resources on Kafka, and the inventory service will then index those resources in PostgreSQL. Call the rebroadcast API of the Topology service, by specifying a tenantId.
    https://master_fqdn/1.0/topology/swagger#!/Crawlers/rebroadcastTopology
    For more information, see Backing up database data (OCP).