You can restore existing topology data for the Agile Service Manager OCP
installation, if backed up earlier. This can be helpful when updating your system, or for
maintenance reasons.
About this task
Assumption: The backup and restore
procedures in these topics assume a standard Agile Service Manager NOI deployment (and
not an
standalone deployment), a shared use of Cassandra, and that the release name used is 'noi'. Adjust
the samples provided to your circumstances.
- Backup
- The backup procedure documented here performs a backup of all the keyspaces in the
Cassandra database, including those not specific to Agile Service Manager.
- Restore
- The restore procedures focus on restoring only the keyspace that is relevant to Agile
Service Manager (that is, 'janusgraph').
Secrets and system_auth keyspace:
During Agile Service Manager deployment, a secret called
{release}-topology-cassandra-auth-secret
is generated, if none already exists with
that name. Cassandra is protected with the user and password of that secret, which will be used by
the Agile Service Manager services to connect to the database.
In the restore scenarios described, it is assumed that Agile Service Manager is deployed in a
standard way, meaning that the connection to Cassandra is set with the described secret. If you were
to restore the system_auth
keyspace (instead of just the
janusgraph
keyspace), you would have to make sure the user and password in the
mentioned secret matches the credentials contained in the keyspace for the version being
restored.
The following command 'exports' the secret from your Kubernetes cluster. However, the process of
backing up and restoring secrets is
not described here, as it is contingent on your company
security policies, which you should
follow.
kubectl get secret noi-cassandra-auth-secret -o yaml
Restore scenarios: You may encounter a variety of typical data restore
scenarios while administering Agile Service Manager on OCP.
- Same cluster data restore (rollback)
- This scenario covers the restore of a database backup to the same cluster from where it was
taken, as documented in the procedure; essentially rolling back your deployment to a previous state.
- You typically perform such a data 'rollback' due to data corruption, or the need to revert some
changes made to your Agile Service Manager deployment.
- Restoring to a different cluster
- In this scenario you recreate your Agile Service Manager deployment data in a different cluster
from the one from which you have taken a
backup.
- You must have Agile Service Manager successfully deployed on your new target cluster.
- Typically, you would restore data to a different cluster in a disaster recovery situation, where
your primary cluster is not accessible; or in a situation where you want to clone a production
system to a test environment.
- The backup procedure stores the backup generated files inside the Agile Service Manager
Cassandra pods inside the /opt/ibm/cassandra/data/backup_tar/ directory. Ensure
that these backup files are present in the target cluster Cassandra pods before attempting this
scenario; either copy them to the new location, or mount the external storage to that location. Once
the backup files are in the correct target location, you restore the backed up data as documented in
the procedure.
Tip: The backup_tar directory may not exist if you did not set a
storageClass for cassandrabak
during the Agile Service Manager
installation.
- Losing a Cassandra node in your OCP cluster
- This scenario describes the steps to perform should you lose a worker node in your OCP cluster
where one of the Agile Service Manager Cassandra pods is running, thereby effectively losing one of
your Cassandra replica nodes.
- This might happen for a variety of reasons and will leave your Cassandra cluster with two
remaining functioning nodes, to which you then add a new node to restore your three-node
configuration.
- See the
Replacing a Cassandra node in your OCP cluster topic for more details.
Procedure
Preparing your system for data restoration
Restoring to a different cluster: The backup procedure stores the backup-generated files
inside the Agile Service Manager Cassandra pods inside the
/opt/ibm/cassandra/data/backup_tar/ directory. If you are restoring your
data to a different cluster, ensure that these backup files are present in the target cluster
Cassandra pods: either copy them to the new location, or mount the external storage to that
location.
-
Authenticate into the Agile Service Manager Kubernetes namespace.
-
Deploy the kPodLoop bash shell function.
kPodLoop is a bash shell function that allows a command to be run against matching
Kubernetes containers. You can copy it into the shell.
kPodLoop() {
__podPattern=$1
__podCommand=$2
__podList=$( kubectl get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=NAME:.metadata.name | grep ${__podPattern} )
printf "Pods found: $(echo -n ${__podList})\n"
for pod in ${__podList}; do
printf "\n===== EXECUTING COMMAND in pod: %-42s =====\n" ${pod}
kubectl exec ${pod} -- bash -c "${__podCommand}"
printf '_%.0s' {1..80}
printf "\n"
done;
}
This
kPodLoop bash shell function filters the pods to run the commands against only those that are in a
'Running' phase. This filter ensures that the configuration pods that are only run as part of your
installation, like the secret generator pod, are skipped.
-
Make a note of the scaling of Agile Service Manager pods.
kubectl get pods --field-selector=status.phase=Running --no-headers=true --output=custom-columns=CNAME:.metadata.ownerReferences[0].name | grep noi-topology | uniq --count
Example
output:1 noi-topology-dns-observer
1 noi-topology-docker-observer
1 noi-topology-file-observer
1 noi-topology-kubernetes-observer
1 noi-topology-layout
1 noi-topology-merge
1 noi-topology-noi-gateway
1 noi-topology-noi-probe
1 noi-topology-observer-service
1 noi-topology-search
1 noi-topology-status
1 noi-topology-topology
1 noi-topology-ui-api
-
Verify access to each Cassandra database (this command will return a list of keyspaces from
each Cassandra node).
kPodLoop noi-cassandra "CASSANDRA_USER=\$(cat \$CASSANDRA_AUTH_USERNAME_FILE); CASSANDRA_PASS=\$(cat \$CASSANDRA_AUTH_PASSWORD_FILE); cqlsh -u \${CASSANDRA_USER} -p \${CASSANDRA_PASS} -e \"DESC KEYSPACES;\""
-
Scale down Agile Service Manager pods.
kubectl scale deployment --replicas=0 noi-topology-dns-observer
kubectl scale deployment --replicas=0 noi-topology-file-observer
kubectl scale deployment --replicas=0 noi-topology-kubernetes-observer
kubectl scale deployment --replicas=0 noi-topology-observer-service
kubectl scale deployment --replicas=0 noi-topology-noi-gateway
kubectl scale deployment --replicas=0 noi-topology-noi-probe
kubectl scale deployment --replicas=0 noi-topology-layout
kubectl scale deployment --replicas=0 noi-topology-merge
kubectl scale deployment --replicas=0 noi-topology-status
kubectl scale deployment --replicas=0 noi-topology-search
kubectl scale deployment --replicas=0 noi-topology-ui-api
kubectl scale deployment --replicas=0 noi-topology-topology
The Cassandra pods
(noi-cassandra) are left active.
Cassandra pods need to be running in order to execute the backup of their
data.
Important: Include in this scale down any additional observers you have installed in your
deployment.
-
Verify that scaling down was successful.
kubectl get pods --field-selector=status.phase=Running | grep noi-topology
The Agile Service Manager services have now been scaled down, and the Cassandra database
contents will not be modified anymore.
Restore data
-
Update the Cassandra restore script to suppress the truncation of restored data.
Note: The restore_cassandra.sh
tool truncates all data in the target table each
time it is used, and despite the restore being targeted at one Cassandra node only, the truncate is
propagated to all nodes. In order to suppress the truncate step, you must update the restore script
on all but the first node.
-
Copy
cassandra_functions.sh
out of one of the asm-cassandra nodes.
kubectl cp noi-cassandra-0:/opt/ibm/backup_scripts/cassandra_functions.sh /tmp/cassandra_functions.sh
-
Edit cassandra_functions.sh
vi /tmp/cassandra_functions.sh
Locate the call to
truncate_all_tables
within the
restore()
function and comment out
the appropriate lines, as in the following
example:
Printf "`date` Starting Restore \n"
#### truncate_all_tables
#### testResult $? "truncate tables"
repair_keyspace
-
Save the file, then copy the file back to all nodes, except the first Cassandra node.
kubectl cp /tmp/cassandra_functions.sh noi-cassandra-1:/opt/ibm/backup_scripts/cassandra_functions.sh
kubectl cp /tmp/cassandra_functions.sh noi-cassandra-2:/opt/ibm/backup_scripts/cassandra_functions.sh
-
Locate the timestamps of the backups from each Cassandra node to restore.
Each node's backup was started at a similar time, so the timestamps may differ by a few
seconds. In the following example a backup was performed at about 2019-06-11 09:36, and grep is then
used to filter to these backup archives.
Tip: You can ignore this step if you are about
to apply the most recent backup. If you do, the -t parameter can be omitted
during all subsequent
steps.
kPodLoop noi-cassandra "ls -larth \${CASSANDRA_DATA}/../backup_tar | grep 2019-06-11-09"
Pods found: noi-cassandra-0 noi-cassandra-1 noi-cassandra-2
===== EXECUTING COMMAND in pod: noi-cassandra-0 =====
-rwxrwxr-x 1 cassandra cassandra 524M Jun 11 09:37 cassandra_noi-topology-cassandra-0_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-04.tar
________________________________________________________________________________
===== EXECUTING COMMAND in pod: noi-cassandra-1 =====
-rwxrwxr-x 1 cassandra cassandra 565M Jun 11 09:37 cassandra_noi-topology-cassandra-1_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-07.tar
________________________________________________________________________________
===== EXECUTING COMMAND in pod: noi-cassandra-2 =====
-rwxrwxr-x 1 cassandra cassandra 567M Jun 11 09:37 cassandra_noi-topology -cassandra-2_KS_system_schema_KS_system_KS_system_distributed_KS_system_auth_KS_janusgraph_KS_system_traces_date_2019-06-11-0936-07.tar
________________________________________________________________________________
-
Working across each Cassandra node, restore the relevant backup of the
janusgraph
keyspace.
-
noi-cassandra-0
Remember: This will cause the existing data in the janusgraph
keyspace
tables to be truncated.
kPodLoop noi-cassandra-0 "/opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph -f"
kPodLoop noi-cassandra-0 "nodetool repair --full janusgraph"
-
noi-cassandra-1
kPodLoop noi-cassandra-1 "/opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph -f"
kPodLoop noi-cassandra-1 "nodetool repair --full janusgraph"
-
noi-cassandra-2
kPodLoop noi-cassandra-2 "/opt/ibm/backup_scripts/restore_cassandra.sh -k janusgraph -f"
kPodLoop noi-cassandra-2 "nodetool repair --full janusgraph"
-
Scale up the services to the original level.
The original level was obtained in a
previous
step.
kubectl scale deployment --replicas=1 noi-topology-topology
kubectl scale deployment --replicas=1 noi-topology-layout
kubectl scale deployment --replicas=1 noi-topology-merge
kubectl scale deployment --replicas=1 noi-topology-status
kubectl scale deployment --replicas=1 noi-topology-search
kubectl scale deployment --replicas=1 noi-topology-observer-service
kubectl scale deployment --replicas=1 noi-topology-noi-gateway
kubectl scale deployment --replicas=1 noi-topology-noi-probe
kubectl scale deployment --replicas=1 noi-topology-ui-api
kubectl scale deployment --replicas=1 noi-topology-dns-observer
kubectl scale deployment --replicas=1 noi-topology-file-observer
kubectl scale deployment --replicas=1 noi-topology-rest-observer
kubectl scale deployment --replicas=1 noi-topology-kubernetes-observer
- After you have run the restore script and scaled up the services, clear historical alert
data using the statusClear crawler.
https://master_fqdn/1.0/topology/swagger#!/Crawlers/statusClear
Restore services
-
Rebroadcast data to
Inventory.
If data in inventory is out of sync with data in the Cassandra database, resynchronize it by
calling the
rebroadcast
API of the topology service. This triggers the rebroadcast
of all known resources on Kafka, and the inventory service will then index those resources in
PostgreSQL. Call the rebroadcast API of the Topology service, specifying a tenantId:
https://master_fqdn/1.0/topology/swagger#!/Crawlers/rebroadcastTopology