Restoring database data (OCP)

You can restore existing topology data for the Agile Service Manager OCP installation, if backed up earlier. This can be helpful when updating your system, or for maintenance reasons.

Before you begin

Before restoring a Cassandra database, you must back it up, as described in the following topic :Backing up database data (OCP)

About this task

Assumption: The backup and restore procedures in these topics assume a standard Agile Service Manager NOI deployment (and not an standalone deployment), a shared use of Cassandra, and that the release name used is 'noi'. Adjust the samples provided to your circumstances.
Backup
The backup procedure documented here performs a backup of all the keyspaces in the Cassandra database, including those not specific to Agile Service Manager.
Restore
The restore procedures focus on restoring only the keyspace that is relevant to Agile Service Manager (that is, 'janusgraph').
Secrets and system_auth keyspace:

During Agile Service Manager deployment, a secret called {release}-topology-cassandra-auth-secret is generated, if none already exists with that name. Cassandra is protected with the user and password of that secret, which will be used by the Agile Service Manager services to connect to the database.

In the restore scenarios described, it is assumed that Agile Service Manager is deployed in a standard way, meaning that the connection to Cassandra is set with the described secret. If you were to restore the system_auth keyspace (instead of just the janusgraph keyspace), you would have to make sure the user and password in the mentioned secret matches the credentials contained in the keyspace for the version being restored.

The following command 'exports' the secret from your Kubernetes cluster. However, the process of backing up and restoring secrets is not described here, as it is contingent on your company security policies, which you should follow.
kubectl get secret noi-cassandra-auth-secret -o yaml
Restore scenarios: You may encounter a variety of typical data restore scenarios while administering Agile Service Manager on OCP.
Same cluster data restore (rollback)
This scenario covers the restore of a database backup to the same cluster from where it was taken, as documented in the procedure; essentially rolling back your deployment to a previous state.
You typically perform such a data 'rollback' due to data corruption, or the need to revert some changes made to your Agile Service Manager deployment.
Restoring to a different cluster
In this scenario you recreate your Agile Service Manager deployment data in a different cluster from the one from which you have taken a backup.
You must have Agile Service Manager successfully deployed on your new target cluster.
Typically, you would restore data to a different cluster in a disaster recovery situation, where your primary cluster is not accessible; or in a situation where you want to clone a production system to a test environment.
The backup procedure stores the backup generated files inside the Agile Service Manager Cassandra pods inside the /opt/ibm/cassandra/data/backup_tar/ directory. Ensure that these backup files are present in the target cluster Cassandra pods before attempting this scenario; either copy them to the new location, or mount the external storage to that location. Once the backup files are in the correct target location, you restore the backed up data as documented in the procedure.
Tip: The backup_tar directory may not exist if you did not set a storageClass for cassandrabak during the Agile Service Manager installation.
Losing a Cassandra node in your OCP cluster
This scenario describes the steps to perform should you lose a worker node in your OCP cluster where one of the Agile Service Manager Cassandra pods is running, thereby effectively losing one of your Cassandra replica nodes.
This might happen for a variety of reasons and will leave your Cassandra cluster with two remaining functioning nodes, to which you then add a new node to restore your three-node configuration.
See the Replacing a Cassandra node in your OCP cluster steps for more details.

Procedure

Preparing your system for data restoration

Restoring to a different cluster: The backup procedure stores the backup-generated files inside the Agile Service Manager Cassandra pods inside the /opt/ibm/cassandra/data/backup_tar/ directory. If you are restoring your data to a different cluster, ensure that these backup files are present in the target cluster Cassandra pods: either copy them to the new location, or mount the external storage to that location.

  1. Authenticate into the Agile Service Manager Kubernetes namespace.
  2. Deploy the kPodLoop bash shell function.
    kPodLoop is a bash shell function that allows a command to be run against matching Kubernetes containers. You can copy it into the shell.
    kPodLoop() {
     __podPattern=$1
     __podCommand=$2
     __podList=$( kubectl get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=NAME:.metadata.name | grep ${__podPattern} )
     printf "Pods found: $(echo -n ${__podList})\n"
     for pod in ${__podList}; do
        printf "\n===== EXECUTING COMMAND in pod: %-42s =====\n" ${pod}
        kubectl exec ${pod} -- bash -c "${__podCommand}"
        printf '_%.0s' {1..80}
        printf "\n"
     done;
    }
    
    This kPodLoop bash shell function filters the pods to run the commands against only those that are in a 'Running' phase. This filter ensures that the configuration pods that are only run as part of your installation, like the secret generator pod, are skipped.
  3. Make a note of the scaling of Agile Service Manager pods.
    kubectl get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=CNAME:.metadata.ownerReferences[0].name | grep noi-topology | uniq --count
    Example output:
    1 noi-topology-dns-observer
    1 noi-topology-docker-observer
    3 noi-topology-elasticsearch
    1 noi-topology-file-observer
    1 noi-topology-kubernetes-observer
    1 noi-topology-layout
    1 noi-topology-merge
    1 noi-topology-noi-gateway
    1 noi-topology-noi-probe
    1 noi-topology-observer-service
    1 noi-topology-search
    1 noi-topology-status
    1 noi-topology-topology
    1 noi-topology-ui-api
    
  4. Verify access to each Cassandra database (this command will return a list of keyspaces from each Cassandra node).
    kPodLoop noi-cassandra "cqlsh -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -e \"DESC KEYSPACES;\""
  5. Scale down Agile Service Manager pods.
    kubectl scale deployment --replicas=0 noi-topology-dns-observer
    kubectl scale deployment --replicas=0 noi-topology-file-observer
    kubectl scale deployment --replicas=0 noi-topology-kubernetes-observer
    kubectl scale deployment --replicas=0 noi-topology-observer-service
    kubectl scale deployment --replicas=0 noi-topology-noi-gateway
    kubectl scale deployment --replicas=0 noi-topology-noi-probe
    kubectl scale deployment --replicas=0 noi-topology-layout
    kubectl scale deployment --replicas=0 noi-topology-merge
    kubectl scale deployment --replicas=0 noi-topology-status
    kubectl scale deployment --replicas=0 noi-topology-search
    kubectl scale deployment --replicas=0 noi-topology-ui-api
    kubectl scale deployment --replicas=0 noi-topology-topology
    
    The Cassandra and ElasticSearch pods (noi-cassandra and noi-topology-elasticsearch) are left active. Cassandra pods need to be running in order to execute the backup of their data, whereas the ElasticSearch pods have no interaction with nor influence on the Cassandra contents, so can be kept running.
    Important: Include in this scale down any additional observers you have installed in your deployment.
  6. Verify that scaling down was successful.
    kubectl get pods --field-selector=status.phase=Running  | grep noi-topology
    The Agile Service Manager services have now been scaled down, and the Cassandra database contents will not be modified anymore.

Restore data

  1. Update the Cassandra restore script to suppress the truncation of restored data.
    Note: The restore_cassandra.sh tool truncates all data in the target table each time it is used, and despite the restore being targeted at one Cassandra node only, the truncate is propagated to all nodes. In order to suppress the truncate step, you must update the restore script on all but the first node.
    1. Copy cassandra_functions.sh out of one of the asm-cassandra nodes.
      kubectl cp noi-cassandra-0:/opt/ibm/backup_scripts/cassandra_functions.sh /tmp/cassandra_functions.sh
    2. Edit cassandra_functions.sh
      vi /tmp/cassandra_functions.sh
      Locate the call to truncate_all_tables within the restore() function and comment out the appropriate lines, as in the following example:
      Printf "`date` Starting Restore \n"
      
      #### truncate_all_tables
      #### testResult $? "truncate tables"
      
      repair_keyspace
      
    3. Save the file, then copy the file back to all nodes, except the first Cassandra node.
      kubectl cp /tmp/cassandra_functions.sh noi-cassandra-1:/opt/ibm/backup_scripts/cassandra_functions.sh
      
      kubectl cp /tmp/cassandra_functions.sh noi-cassandra-2:/opt/ibm/backup_scripts/cassandra_functions.sh
      
  2. Locate the timestamps of the backups from each Cassandra node to restore.
    Each node's backup was started at a similar time, so the timestamps may differ by a few seconds. In the following example a backup was performed at about 2019-06-11 09:36, and grep is then used to filter to these backup archives.
    Tip: You can ignore this step if you are about to apply the most recent backup. If you do, the -t parameter can be omitted during all subsequent steps.
    kPodLoop noi-cassandra "ls -larth \${CASSANDRA_DATA}/../backup_tar | grep 2019-06-11-09"
    
    Pods found: noi-cassandra-0 noi-cassandra-1 noi-cassandra-2
    
    ===== EXECUTING COMMAND in pod: noi-cassandra-0                            =====
    -rwxrwxr-x 1 cassandra cassandra 524M Jun 11 09:37 cassandra_noi-topology-cassandra-0_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-04.tar
    ________________________________________________________________________________
    
    ===== EXECUTING COMMAND in pod: noi-cassandra-1                            =====
    -rwxrwxr-x 1 cassandra cassandra 565M Jun 11 09:37 cassandra_noi-topology-cassandra-1_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-07.tar
    ________________________________________________________________________________
    
    ===== EXECUTING COMMAND in pod: noi-cassandra-2                            =====
    -rwxrwxr-x 1 cassandra cassandra 567M Jun 11 09:37 cassandra_noi-topology -cassandra-2_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-07.tar
    ________________________________________________________________________________
    
  3. Working across each Cassandra node, restore the relevant backup of the janusgraph keyspace.
    Note: For information about the system_auth keyspace see the Secrets and system_auth keyspace note.
    The specific node pod names can vary depending on your installation and release names used
    1. noi-cassandra-0
      Remember: This will cause the existing data in the janusgraph keyspace tables to be truncated.
      kPodLoop noi-cassandra-0 "/opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph  -t 2019-06-11-0936-04 -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -f"
      
       kPodLoop noi-cassandra-0 "nodetool repair --full janusgraph"
      
    2. noi-cassandra-1
      kPodLoop noi-cassandra-1 "/opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph  -t 2019-06-11-0936-07 -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -f"
      
       kPodLoop noi-cassandra-1 "nodetool repair --full janusgraph"
      
    3. noi-cassandra-2
      kPodLoop noi-cassandra-2 "/opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph  -t 2019-06-11-0936-07 -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -f"
      
       kPodLoop noi-cassandra-2 "nodetool repair --full janusgraph"
      
  4. After you have run the restore script, clear historical alert data using the statusClear crawler.
    https://master_fqdn/1.0/topology/swagger#!/Crawlers/statusClear

Restore services

  1. Scale up the services to the original level.
    The original level was obtained in a previous step.
    kubectl scale deployment --replicas=1 noi-topology-topology
    kubectl scale deployment --replicas=1 noi-topology-layout
    kubectl scale deployment --replicas=1 noi-topology-merge
    kubectl scale deployment --replicas=1 noi-topology-status
    kubectl scale deployment --replicas=1 noi-topology-search
    kubectl scale deployment --replicas=1 noi-topology-observer-service
    kubectl scale deployment --replicas=1 noi-topology-noi-gateway
    kubectl scale deployment --replicas=1 noi-topology-noi-probe
    kubectl scale deployment --replicas=1 noi-topology-ui-api
    kubectl scale deployment --replicas=1 noi-topology-dns-observer
    kubectl scale deployment --replicas=1 noi-topology-file-observer
    kubectl scale deployment --replicas=1 noi-topology-rest-observer
    kubectl scale deployment --replicas=1 noi-topology-kubernetes-observer
    
  2. Rebroadcast data to ElasticSearch (that is, re-index Elasticsearch).
    If data in Elasticsearch is out of sync with data in the Cassandra database, resynchronize it by calling the rebroadcast API of the topology service. This triggers the rebroadcast of all known resources on Kafka, and the Search service will then index those resources in Elasticsearch. Call the rebroadcast API of the Topology service, specifying a tenantId:
    https://master_fqdn/1.0/topology/swagger#!/Crawlers/rebroadcastTopology

Replacing a Cassandra node in your OCP cluster

Agile Service Manager configuration for Cassandra sets a three-node cluster with a replication factor of three, which means your deployment would still be fully functional should you lose one node. To mitigate the risk of a second node failing, however, you would perform the steps documented here to restore the level of service to a three node Cassandra cluster.

About this task

Important: These steps provide basic instructions on how to diagnose and recover from the described situation. However, it is likely that in your specific situation you may encounter differences in how your OCP cluster or Cassandra behave under such a failure, in which case you should engage your company Cassandra administrators.

Procedure

Verify the state of your Cassandra cluster

  1. Authenticate into the Kubernetes namespace where Agile Service Manager is deployed as part of your solution.
  2. Check the status of your Cassandra pods.
    The following example system output shows that a pod is being terminated because of a node on the cluster failing:
    NAME    READY   STATUS        RESTARTS   AGE
    noi-cassandra-0    1/1     Running       1          4d2h
    noi-cassandra-1    1/1     Running       1          4d2h
    noi-cassandra-2    1/1     Terminating   1          4d2h 
    
  3. Verify that the Cassandra node is down.
    For the pods that are still running, use a command like the following example:
    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool status -r"
    Sample system output, where the line starting with 'D' indicates a node that is down.:
    Datacenter: datacenter1
    =======================
    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving
    --  Address       Load       Tokens       Owns (effective)  Host ID                               Rack
    UN  noi-cassandra-0.noi-cassandra.<project>.svc.cluster.local  64.34 MiB  256          100.0%            f6e6f151-ca7b-4117-be87-97245e61d7e9  rack1
    
    UN  noi-cassandra-1.noi-cassandra.<project>.svc.cluster.local 64.34 MiB  256          100.0%            989027b6-896b-4622-b282-9aa1dc2d9e39  rack1
    
    DN  10.254.4.4    64.31 MiB  256          100.0%            ce054185-d72a-4b48-9c34-e8199b6e1559  rack1
    

Restore a three-node Cassandra cluster

  1. Authenticate into the Kubernetes namespace where Agile Service Manager is deployed as part of your solution.
  2. Remove the Cassandra node that is down from the Cassandra cluster before the new one comes online.
    Use a command like the following example (using your node ID).

    This command runs Cassandra's nodetool utility in one of the nodes still up (in this case noi-cassandra-0) to remove the Cassandra node that is marked as down.

    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool removenode ce054185-d72a-4b48-9c34-e8199b6e1559"
  3. Confirm the deletion using a command like the following example against one of the running pods:
    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool status -r"
  4. Delete the pod that was running in the lost node reporting Terminating state:
    kubectl delete pod noi-cassandra-2 --grace-period=0 --force
    Tip: You may not need to delete the pod, depending on your cluster configuration and the outcome of your investigation. It will, however, be necessary to delete it if the pod is permanently stuck in the 'Terminating' state, as in this example.
  5. Bring the new node online in your OCP cluster.
    The container is initialized to join the Cassandra cluster, replacing the removed node.
    Troubleshooting: Check whether the newly added Cassandra node lists itself as a seed. Run the following command to check the seeds configured. In this example, the added node is noi-cassandra-2 pod.
    kubectl exec noi-cassandra-2 -- bash -c "grep seeds: /opt/ibm/cassandra/conf/cassandra.yaml"
    System output example listing the seeds configured for the node:
        - seeds: "10.254.12.2,10.254.8.7"
    If the newly added node is listing itself as a seed, it can report inconsistent information with unexpected results, should it be the node queried by the Agile Service Manager services. To limit the potential impact, follow the Preparing your system for data restoration steps to stop your Agile Service Manager services from accessing your data until you have stabilized the new node.
  6. Perform a full repair.
    The following command instructs Cassandra to perform a repair of the data:
    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool repair --full"
    Tip: For large data sets, it is preferred that you run the previous repair command several times for a limited range of tokens each time. You can get a list of tokens with the following command:
    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool info --tokens"
    Example system output:
    ID                     : e2494466-0cc9-4268-a5f8-5d5fa363faaa
    …
    Token                  : -9026954462746495840
    Token                  : -8998340199710379626
    …
    Token                  : 9099714334544743528
    Token                  : 9120502118133589206
    
    Run the repair command several times, each time specifying a different range of tokens, for example:
    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool repair --start-token -9026954462746495840 --end-token -8998340199710379626"
    You can check the progress of the repair in the Cassandra pods logs.
  7. Perform a checksum verification of your data.
    For example:
    kubectl exec noi-cassandra-0 -- bash -c "/opt/ibm/cassandra/bin/nodetool verify -e"
    If any errors are returned, repeat the full repair step until the checksum verification no longer returns any errors.