Backing up database data (OCP)

You can back up (and later restore) existing topology data for the Agile Service Manager OCP installation. This can be helpful when updating your system, as part of your company's data management best-practice, or for maintenance reasons.

About this task

Agile Service Manager stores its information in the Cassandra database, using in turn PostgreSQL for providing inventory-like search and query capabilities of the same data. The backup and restore procedures are therefore based on backing up the Cassandra database (where all the information is stored) and, upon its restoration, triggering a topology re-broadcast to re-populate PostgreSQL to keep the consistency of the information in both stores.

Note: Back up Postgres data before you back up Cassandra data. If backup order differs, Postgres data might contain information that doesn't exist in the topology because of the order in which storage is backed up. Because of this order, it is also suggested that you restore Postgres data before you restore Cassandra data. After restoration in this order, gaps between Postgres data and Cassandra data can be rebuilt with the resync crawler.
Agile Service Manager can be installed on RedHat OpenShift Container Platform (OCP) following different deployment models:
  • As part of a Netcool Operations Insight solution
  • Or as a stand-alone product (that is, without NOI) as part of a Watson AIOps deployment
Naming conventions for Agile Service Manager on OCP (as part of NOI or AIOps Event Manager)
When Agile Service Manager is deployed as part of NOI, the naming convention the Agile Service Manager pods follow is: {releaseName}-topology-{resource}-{suffix}, where 'suffix' is a number for the Stateful Sets (such as a Cassandra database) that represents the pod replica number, or a hash-like uid for the Deployment pods.
For example, Cassandra pods might be {releaseName}-topology-cassandra-0 for the first replica of the set, whereas the topology service pod could be {releaseName}-topology-topology-57477f4978-qp9gz.
In cases where Cassandra is shared, the shared instance pods Agile Service Manager uses would be {releaseName}-cassandra-{suffix} instead.
Naming conventions for Agile Service Manager on OCP standalone (pre-reqs for AIOps AI Manager)
When Agile Service Manager is installed as a standalone product, the naming convention for its pods is {releaseName}-{resource}-{suffix}.

When deployed as part of Netcool Operations Insight, and depending on whether you are installing for the first time or upgrading, you can share the Cassandra database instances between all the NOI products, or deploy separate instances for Agile Service Manager and NOI. When sharing the Cassandra instances, the database will contain data from the rest of the components in NOI that make use of Cassandra, as well as the Agile Service Manager data.

Assumption: The backup and restore procedures in these topics assume a standard Agile Service Manager NOI deployment (and not an standalone deployment), a shared use of Cassandra, and that the release name used is 'noi'. Adjust the samples provided to your circumstances.
Backup
The backup procedure documented here performs a backup of all the keyspaces in the Cassandra database, including those not specific to Agile Service Manager.
Restore
The restore procedures focus on restoring only the keyspace that is relevant to Agile Service Manager (that is, 'janusgraph').

Procedure

Preparing your system for backup

  1. Authenticate into the Kubernetes namespace where Agile Service Manager is deployed as part of your solution.
  2. Deploy the following kPodLoop bash shell function.
    kPodLoop is a bash shell function that allows a command to be run against all the matching Kubernetes containers. You can copy it into the shell.
    kPodLoop() {
     __podPattern=$1
     __podCommand=$2
     __podList=$( kubectl get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=NAME:.metadata.name | grep ${__podPattern} )
     printf "Pods found: $(echo -n ${__podList})\n"
     for pod in ${__podList}; do
        printf "\n===== EXECUTING COMMAND in pod: %-42s =====\n" ${pod}
        kubectl exec ${pod} -- bash -c "${__podCommand}"
        printf '_%.0s' {1..80}
        printf "\n"
     done;
    }
    
    This kPodLoop bash shell function filters the pods to run the commands against only those that are in a 'Running' phase. This filter ensures that the configuration pods that are only run as part of your installation, like the secret generator pod, are skipped.
  3. Make a note of the scaling of Agile Service Manager pods.
    kubectl get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=CNAME:.metadata.ownerReferences[0].name | grep noi-topology | uniq --count
    Example output:
    1 noi-topology-dns-observer
    1 noi-topology-docker-observer
    3 noi-topology-elasticsearch
    1 noi-topology-file-observer
    1 noi-topology-kubernetes-observer
    1 noi-topology-layout
    1 noi-topology-merge
    1 noi-topology-noi-gateway
    1 noi-topology-noi-probe
    1 noi-topology-observer-service
    1 noi-topology-search
    1 noi-topology-status
    1 noi-topology-topology
    1 noi-topology-ui-api
    
  4. Verify access to each Cassandra database (this command will return a list of keyspaces from each Cassandra node). Adjust the Cassandra pod names based on your deployment model naming convention.
    Example commands:
    kPodLoop noi-cassandra "CASSANDRA_USER=\$(cat \$CASSANDRA_AUTH_USERNAME_FILE); CASSANDRA_PASS=\$(cat \$CASSANDRA_AUTH_PASSWORD_FILE); cqlsh -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -e \"DESC KEYSPACES;\""
    
    Note: For SSL-enabed environments, additionally use the following command:
    kPodLoop aiops-topology-cassandra “cqlsh -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -e \“DESC KEYSPACES;\“ --ssl”
  5. Scale down Agile Service Manager pods.
    kubectl scale deployment --replicas=0 noi-topology-dns-observer
    kubectl scale deployment --replicas=0 noi-topology-file-observer
    kubectl scale deployment --replicas=0 noi-topology-kubernetes-observer
    kubectl scale deployment --replicas=0 noi-topology-observer-service
    kubectl scale deployment --replicas=0 noi-topology-noi-gateway
    kubectl scale deployment --replicas=0 noi-topology-noi-probe
    kubectl scale deployment --replicas=0 noi-topology-layout
    kubectl scale deployment --replicas=0 noi-topology-merge
    kubectl scale deployment --replicas=0 noi-topology-status
    kubectl scale deployment --replicas=0 noi-topology-search
    kubectl scale deployment --replicas=0 noi-topology-ui-api
    kubectl scale deployment --replicas=0 noi-topology-topology
    
    The Cassandra and Elasticsearch pods (noi-cassandra and noi-topology-elasticsearch) are left active. Cassandra pods need to be running in order to execute the backup of their data, whereas the ElasticSearch pods have no interaction with nor influence on the Cassandra contents, so can be kept running.
    Important: Include in this scale down any additional observers in your deployment.
  6. Verify that scaling down was successful.
    kubectl get pods --field-selector=status.phase=Running  | grep noi-topology
    The Agile Service Manager services have now been scaled down, and the Cassandra database contents will not be modified anymore.

Backing up data

  1. Deploy the pbkc bash shell function.
    The pbkc function attempts to backup the Cassandra database on all nodes as close to simultaneously as possible. You can copy it into the shell. Modify the following example script and change RELEASE to conform to your installation's Cassandra pod names based on your deployment model naming conventions.
    pbkc() {
     ## Parallel Backup of Kubernetes Cassandra
     RELEASE=noi
    
     DATE=$( date +"%F-%H-%M-%S" )
     LOGFILEBASE=/tmp/clusteredCassandraBackup-${DATE}-
     declare -A LOG
    
     ## get the current list of cassandra pods.
     podlist=$( oc get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=NAME:.metadata.name | grep ${RELEASE}-cassandra )
     for pod in ${podlist}; do
      echo -e "BACKING UP CASSANDRA IN POD ${pod}"
      oc exec ${pod} -- bash -c "/opt/ibm/backup_scripts/backup_cassandra.sh -f > /dev/null 2> /dev/null &"  
     done
    
     printf "Waiting for backups to complete:"
    
     COMPLETE="false"
     while [ $COMPLETE = "false" ]; do
      COMPLETE="true"
      for pod in ${podlist}; do
        log=`oc exec $pod -- bash -c "ps -ef | grep backup_cassandra.sh- | grep -v grep | awk '{print \\\$NF}'"`
        if [ $? != 0 ]; then
          COMPLETE="false"
        elif [ -n "$log" ]; then
         COMPLETE="false"
         LOG[$pod]=$log
        fi
      done
      printf "-"
      if [ $COMPLETE = "false" ]; then
        sleep 30
      fi
     done
    
     printf "\n"
    
     for pod in ${podlist}; do
      log=${LOGFILEBASE}${pod}.log
      oc cp ${pod}:${LOG[$pod]} ${log}
      echo -e "Backup of ${pod} completed, please verify via log file ($log)"
     done
    }
  2. Run a clean-up on all keyspaces in all Cassandra instances.
    Example Cassandra keyspaces cleanup:
    kPodLoop noi-cassandra "nodetool cleanup system_schema"
    kPodLoop noi-cassandra "nodetool cleanup system"
    kPodLoop noi-cassandra "nodetool cleanup system_distributed"
    kPodLoop noi-cassandra "nodetool cleanup system_auth"
    kPodLoop noi-cassandra "nodetool cleanup janusgraph"
    kPodLoop noi-cassandra "nodetool cleanup system_traces"
    
  3. Run backup on all Cassandra instances (using the pbkc shell function just deployed).
    pbkc
  4. Check the final output in the log file for each backup.
    Adjust the date in the grep command as appropriate.
    grep  "BACKUP DONE SUCCESSFULLY" /tmp/clusteredCassandraBackup-2019-06-14-14-09-50*
    /tmp/clusteredCassandraBackup-2019-06-14-14-09-50-noi-cassandra-0.log:Fri Jun 14 14:11:04 UTC 2019 BACKUP DONE SUCCESSFULLY !!!
    /tmp/clusteredCassandraBackup-2019-06-14-14-09-50-noi-cassandra-1.log:Fri Jun 14 14:11:16 UTC 2019 BACKUP DONE SUCCESSFULLY !!!
    /tmp/clusteredCassandraBackup-2019-06-14-14-09-50-noi-cassandra-2.log:Fri Jun 14 14:11:16 UTC 2019 BACKUP DONE SUCCESSFULLY !!!
    
    Tip: For additional information about backing up secrets and the system_auth keyspace, see the Secrets and system_auth keyspace note in the 'Restoring database data (OCP)' topic.
    When backup has successfully completed, you restore your Agile Service Manager services to normal operation.

Restore services

  1. Scale up the services to the original level.
    The original level was obtained in a previous step.
    kubectl scale deployment --replicas=1 noi-topology-topology
    kubectl scale deployment --replicas=1 noi-topology-layout
    kubectl scale deployment --replicas=1 noi-topology-merge
    kubectl scale deployment --replicas=1 noi-topology-status
    kubectl scale deployment --replicas=1 noi-topology-search
    kubectl scale deployment --replicas=1 noi-topology-observer-service
    kubectl scale deployment --replicas=1 noi-topology-noi-gateway
    kubectl scale deployment --replicas=1 noi-topology-noi-probe
    kubectl scale deployment --replicas=1 noi-topology-ui-api
    kubectl scale deployment --replicas=1 noi-topology-dns-observer
    kubectl scale deployment --replicas=1 noi-topology-file-observer
    kubectl scale deployment --replicas=1 noi-topology-rest-observer
    kubectl scale deployment --replicas=1 noi-topology-kubernetes-observer
    

Results

The backup procedure stores the backup generated files inside the Agile Service Manager Cassandra pods inside the /opt/ibm/cassandra/data/backup_tar/ directory.

What to do next

You can restore your backed up data as and when required.