You can restore existing topology data for the Agile Service Manager
Red Hat®
OpenShift®
installation, if backed up earlier. The backing up of data is helpful during your system update, or
during maintenance.
About this task
Assumption: The backup and restore procedures in these topics assume a standard Agile Service Manager Netcool® Operations Insight® deployment (and not a standalone deployment), a shared use of Cassandra, and that the release name used is 'noi'. Adjust the samples provided to your circumstances.
- Backup
- The backup procedure documented here performs a backup of all the keyspaces in the
Cassandra database, including those not specific to Agile Service Manager.
- Restore
- The restore procedures focus on restoring only the keyspace that is relevant to Agile
Service Manager (that is, 'janusgraph').
Secrets and system_auth keyspace:
During Agile Service Manager deployment, a secret called {release}-topology-cassandra-auth-secret
is
generated, if none exists with that name. Cassandra is protected with the username and password of
secret, which is used by the Agile Service Manager services to
connect to.
In the restore scenarios described, it is assumed that Agile Service Manager is deployed in a
standard way, meaning that the connection to Cassandra is set with the described secret. If you want
to restore the system_auth
keyspace (instead of just the
janusgraph
keyspace), you must make sure that the username, and password in the
mentioned secret matches the credentials in the keyspace for the restored version.
The following command 'exports' the secret from your Kubernetes cluster. However, the process of
backing up and restoring secrets is
not described here, as it is contingent on your company
security policies, which you must
follow.
kubectl get secret noi-cassandra-auth-secret -o yaml
Restore scenarios: You might encounter various typical data restore
scenarios while administering
Agile Service Manager on
Red Hat
OpenShift.
- Same cluster data restore (rollback)
- This scenario covers the restore of a database backup to the same cluster from where it was
taken, as documented in the procedure; essentially rolling back your deployment to a previous state.
- You perform a data 'rollback' due to data corruption, or the need to revert some changes that
are made to your Agile Service Manager deployment.
- Restoring to a different cluster
- In this scenario, you recreate your Agile Service Manager deployment data
in a different cluster from the one you have taken a
backup.
- You must have Agile Service Manager successfully
deployed on your new target cluster.
- Typically, you would restore data to a different cluster in a disaster recovery situation, where
your primary cluster is not accessible; or in a situation where you want to clone a production
system to a test environment.
- The backup procedure stores the files that are generated through backup inside the Agile Service Manager Cassandra pods in
the /opt/ibm/cassandra/data/backup_tar/ directory. Ensure that these backup
files are present in the target cluster Cassandra pods before attempting this scenario; either copy
them to the new location, or mount the external storage to that location. Once the backup files are
in the correct target location, you restore the backed up data as documented in the procedure.
Tip: The
backup_tar directory might not exist if you did not set a storageClass for
cassandrabak
during the Agile Service Manager
installation.
- Losing a Cassandra node in your Red Hat
OpenShift cluster
- This scenario describes the steps to perform when you lose a worker node in your Red Hat
OpenShift cluster where one of
the Agile Service Manager
Cassandra pods is running, thereby effectively losing one of your Cassandra replica nodes.
- This might happen for various reasons and leaves your Cassandra cluster with two remaining
functioning nodes, to which you then add a node to restore your three-node configuration.
Procedure
Preparing your system for data restoration
Restoring to a different cluster: The backup procedure stores the
backup-generated files inside the Agile Service Manager Cassandra pods
inside the /opt/ibm/cassandra/data/backup_tar/ directory. If you are
restoring your data to a different cluster, ensure that these backup files are present in the
target cluster Cassandra pods: either copy them to the new location, or mount the external storage
to that location.
-
Authenticate into the Agile Service Manager Kubernetes
namespace on the primary data center.
-
Deploy the
kPodLoop
bash shell function.
kPodLoop
is a bash shell function that allows a command to be run against
matching Kubernetes containers. You can copy it into the shell.
kPodLoop() {
__podPattern=$1
__podCommand=$2
__podList=$( kubectl get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=NAME:.metadata.name | grep ${__podPattern} )
printf "Pods found: $(echo -n ${__podList})\n"
for pod in ${__podList}; do
container=`oc get pods ${pod} -o jsonpath='{.spec.containers[*].name}' | tr ' ' '\n' | head -1`
printf "\n===== EXECUTING COMMAND in pod: %-42s =====\n" ${pod}
oc exec -c ${container} ${pod} -- bash -c "${__podCommand}"
printf '_%.0s' {1..80}
printf "\n"
done;
}
kPodLoop
bash shell function filters the pods to run the commands against those
pods only that are in a 'Running' phase. This filter ensures that the configuration pods that are
only running as part of your installation, like the secret generator pod, are skipped.
-
Make a note of the scaling of Agile Service Manager pods.
oc get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=CNAME:.metadata.ownerReferences[0].name | grep primary-topology | uniq --count
Example
output:1 primary-topology-dns-observer
1 primary-topology-docker-observer
3 primary-topology-elasticsearch
1 primary-topology-file-observer
1 primary-topology-kubernetes-observer
1 primary-topology-layout
1 primary-topology-merge
1 primary-topology-noi-gateway
1 primary-topology-noi-probe
1 primary-topology-observer-service
1 primary-topology-search
1 primary-topology-status
1 primary-topology-topology
1 primary-topology-ui-api
-
Verify access to each Cassandra database (this command returns a list of keyspaces from each
Cassandra node).
kPodLoop primary-cassandra "CASSANDRA_USER=\$(cat \$CASSANDRA_AUTH_USERNAME_FILE); CASSANDRA_PASS=\$(cat \$CASSANDRA_AUTH_PASSWORD_FILE); cqlsh -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} \${CASSANDRA_CLIENT_ENCRYPTION_ENABLED:+--ssl} -e \"DESC KEYSPACES;\""
-
Scale down Agile Service Manager pods.
oc scale deployment --replicas=0 primary-topology-dns-observer
oc scale deployment --replicas=0 primary-topology-file-observer
oc scale deployment --replicas=0 primary-topology-kubernetes-observer
oc scale deployment --replicas=0 primary-topology-observer-service
oc scale deployment --replicas=0 primary-topology-noi-gateway
oc scale deployment --replicas=0 primary-topology-noi-probe
oc scale deployment --replicas=0 primary-topology-layout
oc scale deployment --replicas=0 primary-topology-merge
oc scale deployment --replicas=0 primary-topology-status
oc scale deployment --replicas=0 primary-topology-search
oc scale deployment --replicas=0 primary-topology-ui-api
oc scale deployment --replicas=0 primary-topology-topology
The Cassandra and
ElasticSearch pods (primary-cassandra and primary-topology-elasticsearch) are left active. Cassandra
pods need to be kept running in order to execute the backup of their data, whereas the ElasticSearch
pods have no interaction with nor influence on the Cassandra contents, so can be kept running.
Important: Include in this scale down any additional observers that you installed in your
deployment.
-
Verify that scaling down was successful.
oc get pods --field-selector=status.phase=Running | grep primary-topology
The Agile Service Manager services have now
been scaled down, and the Cassandra database contents will not be modified anymore.
- Repeat steps 1 - 6 for the backup data center.
Restore data
-
On the primary data center, update the Cassandra restore script to suppress the truncation of
restored data and the copying of data to the backup date center.
Note: The restore_cassandra.sh
tool truncates all the data in the target table each
time it is used, and despite the restore being targeted at one Cassandra node only, the truncate is
propagated to all nodes. In order to suppress the truncate step, you must update the restore script
on all the nodes except the first node.
-
Copy
cassandra_functions.sh
out of one of the asm-cassandra nodes.
oc cp primary-cassandra-0-0:/opt/ibm/backup_scripts/cassandra_functions.sh /tmp/cassandra_functions.sh
Edit cassandra_functions.sh
vi /tmp/cassandra_functions.sh
-
Locate the call to
truncate_all_tables
within the restore()
function and comment out the appropriate lines, as in the following example:
Printf "`date` Starting Restore \n"
#### truncate_all_tables
#### testResult $? "truncate tables"
-
Save the file. Then, copy the file back to all nodes, except the first Cassandra node.
oc cp /tmp/cassandra_functions.sh primary-cassandra-1-0:/opt/ibm/backup_scripts/cassandra_functions.sh
oc cp /tmp/cassandra_functions.sh primary-cassandra-2-0:/opt/ibm/backup_scripts/cassandra_functions.sh
-
Locate the timestamps of restoring the backups from each Cassandra node.
Each node's backup was started at a similar time, so the timestamps may differ by a few
seconds. In the following example a backup was performed at about 2019-06-11 09:36, and grep is then
used to filter to these backup archives.
Tip: You can ignore this step if you are about
to apply the most recent backup. If you do, the -t parameter can be omitted
during all subsequent
steps.
kPodLoop primary-cassandra "ls -larth \${CASSANDRA_DATA}/../backup_tar | grep 2019-06-11-09"
Pods found: primary-cassandra-0-0 primary-cassandra-1-0 primary-cassandra-2-0
===== EXECUTING COMMAND in pod: primary-cassandra-0-0 =====
-rwxrwxr-x 1 cassandra cassandra 524M Jun 11 09:37 cassandra_primary-topology-cassandra-0-0_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-04.tar
________________________________________________________________________________
===== EXECUTING COMMAND in pod: primary-cassandra-1-0 =====
-rwxrwxr-x 1 cassandra cassandra 565M Jun 11 09:37 cassandra_primary-topology-cassandra-1-0_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-07.tar
________________________________________________________________________________
===== EXECUTING COMMAND in pod: primary-cassandra-2-0 =====
-rwxrwxr-x 1 cassandra cassandra 567M Jun 11 09:37 cassandra_primary-topology-cassandra-2-0_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-07.tar
________________________________________________________________________________
-
Working across each Cassandra node in the primary data center, restore the relevant backup of
the
janusgraph
keyspace.
-
primary-cassandra-0-0
Remember: This will cause the existing data in the janusgraph
keyspace
tables to be truncated.
kPodLoop primary-cassandra-0-0 "CASSANDRA_USER=\$(cat \$CASSANDRA_AUTH_USERNAME_FILE); CASSANDRA_PASS=\$(cat \$CASSANDRA_AUTH_PASSWORD_FILE); /opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph -t 2019-06-11-0936-04 -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -f"
-
primary-cassandra-1-0
kPodLoop primary-cassandra-1-0 "CASSANDRA_USER=\$(cat \$CASSANDRA_AUTH_USERNAME_FILE); CASSANDRA_PASS=\$(cat \$CASSANDRA_AUTH_PASSWORD_FILE); /opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph -t 2019-06-11-0936-04 -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -f"
-
primary-cassandra-2-0
kPodLoop primary-cassandra-2-0 "CASSANDRA_USER=\$(cat \$CASSANDRA_AUTH_USERNAME_FILE); CASSANDRA_PASS=\$(cat \$CASSANDRA_AUTH_PASSWORD_FILE); /opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph -t 2019-06-11-0936-04 -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -f"
- For one node in the primary data center, repair the
janusgraph
key
space.
kPodLoop primary-cassandra-0 "nodetool repair --full --local janusgraph"
- For each node in the backup data center, rebuild the Cassandra node from the primary data
center.
backup-cassandra-0-0
kPodLoop backup-cassandra-0-0 "nodetool rebuild -- dc-1"
backup-cassandra-1-0
kPodLoop backup-cassandra-1-0 "nodetool rebuild -- dc-1"
backup-cassandra-2-0
kPodLoop backup-cassandra-2-0 "nodetool rebuild -- dc-1"
- After you run the rebuild scripts, clear historical alert data by using the
statusClear
crawler from the primary data center.
https://master_fqdn/1.0/topology/swagger#!/Crawlers/statusClear
Restore services
-
Scale up the services to the original level.
The original level was obtained in a
previous
step.
oc scale deployment --replicas=1 primary-topology-topology
oc scale deployment --replicas=1 primary-topology-layout
oc scale deployment --replicas=1 primary-topology-merge
oc scale deployment --replicas=1 primary-topology-status
oc scale deployment --replicas=1 primary-topology-search
oc scale deployment --replicas=1 primary-topology-observer-service
oc scale deployment --replicas=1 primary-topology-noi-gateway
oc scale deployment --replicas=1 primary-topology-noi-probe
oc scale deployment --replicas=1 primary-topology-ui-api
oc scale deployment --replicas=1 primary-topology-dns-observer
oc scale deployment --replicas=1 primary-topology-file-observer
oc scale deployment --replicas=1 primary-topology-rest-observer
oc scale deployment --replicas=1 primary-topology-kubernetes-observer
- Scale up the services in the backup data center to the original level.
-
Rebroadcast data to ElasticSearch (that is, re-index Elasticsearch).
If data in Elasticsearch is out of sync with data in the Cassandra database, resynchronize it
by calling the
rebroadcast
API of the topology service. This triggers the
rebroadcast of all known resources on Kafka, and the Search service will then index those resources
in Elasticsearch. Call the rebroadcast API of the Topology service, specifying a tenantId:
https://master_fqdn/1.0/topology/swagger#!/Crawlers/rebroadcastTopology
Backing up topology database data