Recovering from a long time site outage
Use these troubleshooting instructions to recover a site from a long time site outage on a geo-redundancy enabled cluster.
About this task
gc_grace_seconds
, then the nodetool repair
functionality cannot be
used, as the deleted resources would reappear. In such cases, you can recover the site by completing
the steps that are listed in the following procedure. Note: The default value for
gc_grace_seconds
is 864000 seconds (10 days), and is stored in Cassandra per table
in system_schema.tables
.Procedure
- Log in to a running Cassandra node in the operational data center.
oc rsh <release-name>-cassandra-0-0
- Run the Python utility script to remove the affected data center. The script takes a
while to run.
python3 /opt/ibm/datacenter.py --remove <affected-data-center-name>
See the sample output. In this example,primary
is the release name of the operational data center anddc-2
is the name of the affected data center.$ oc rsh primary-cassandra-0-0 sh-4.4$ python3 /opt/ibm/datacenter.py --remove dc-2 Client encryption enabled Connected : 'localhost' Current replication strategy of keyspace 'system_distributed' : {'dc-1': 3, 'dc-2': 3} Changed replication strategy for 'system_distributed' to : 'dc-1': 3 Current replication strategy of keyspace 'janusgraph' : {'dc-1': 3, 'dc-2': 3} Changed replication strategy for 'janusgraph' to : 'dc-1': 3 Current replication strategy of keyspace 'system_traces' : {'dc-1': 2, 'dc-2': 2} Changed replication strategy for 'system_traces' to : 'dc-1': 2 Current replication strategy of keyspace 'system_auth' : {'dc-1': 3, 'dc-2': 3} Changed replication strategy for 'system_auth' to : 'dc-1': 3 Datacenter: dc-1 ================ Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 20.20.1.1 375.9 KiB 16 100.0% 496db012-a177-4929-a3b9-44f9ee088285 rack-1 UN 20.20.1.2 235.05 KiB 16 100.0% 12705a85-23a1-4cbe-935e-6aa3bdbe3a45 rack-1 UN 20.20.1.3 259.1 KiB 16 100.0% 17136c9a-76e6-4c3b-ae40-74546d927e4a rack-1 Datacenter: dc-2 ================ Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack DN 20.20.2.1 342.49 KiB 16 0.0% 3dcf8b32-0773-427b-8856-b0e085e0a5c2 rack-1 DN 20.20.2.2 337.68 KiB 16 0.0% d26d8ed5-a179-44e7-a660-e63dd2929daa rack-1 DN 20.20.2.3 311.86 KiB 16 0.0% 5086a32d-82aa-4d3a-b11b-8099fcdcaa33 rack-1 Removing node with host id : 3dcf8b32-0773-427b-8856-b0e085e0a5c2 Removing node with host id : d26d8ed5-a179-44e7-a660-e63dd2929daa Removing node with host id : 5086a32d-82aa-4d3a-b11b-8099fcdcaa33 Datacenter: dc-1 ================ Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 20.20.1.1 375.89 KiB 16 100.0% 496db012-a177-4929-a3b9-44f9ee088285 rack-1 UN 20.20.1.2 235.36 KiB 16 100.0% 12705a85-23a1-4cbe-935e-6aa3bdbe3a45 rack-1 UN 20.20.1.3 265.27 KiB 16 100.0% 17136c9a-76e6-4c3b-ae40-74546d927e4a rack-1 sh-4.4$ exit
- If the Cassandra cluster is running, shut down the Cassandra cluster in the nonoperational data center. Then, delete the Cassandra cluster data from the nonoperational data center.
You can use different storage locations when you deploy Netcool® Operations Insight® on OpenShift®. Different storage locations have different ways to delete data. For local storage, the Persistent Volume Claim (PVC) and data from the physical volumes need to be deleted. For some configurations of Red Hat® Ceph®, deleting the PVC is enough.
- Start the nonoperational data center.
- Wait for either node 0 or node 1 to completely start (
Ready 2/2
). - Edit the statefulset for the node that is not started.
oc edit sts <release-name>-cassandra-<node>
- Change the Cassandra seeds to include all seeds, and add the
START_NOW
value.- name: CASSANDRA_SEEDS value: <IP-address-of-primary-node 0>,<IP-address-of-primary-node 1>,<IP-address-of-backup-node 0>,<IP-address-of-backup-node 1> - name: CASSANDRA_START_NOW value: "true"
Note: You can obtain the list of all seeds from any running node by running theoc logs <release-name>-cassandra-0-0 -c <release-name>-cassandra | grep 'Seed list'
command. - Wait for the node to be ready
(Ready 2/2)
. - Repeat steps 6, 7, and 8 for node 2.
- Run the Python utility script on a running node within the operational data center, such
as the Cassandra pod, to add back the nonoperational data center.
/opt/ibm/datacenter.py --add <previously-affected-data-center>
See the sample output. In this example,primary
is the release name of the operational data center anddc-2
is the name of the nonoperational data center.$ oc rsh primary-cassandra-0-0 sh-4.4$ python /opt/ibm/datacenter.py --add dc-2 Client encryption enabled Connected : 'localhost' Current replication strategy of keyspace 'system_auth' : {u'dc-1': 3} Changed replication strategy for 'system_auth' to : 'dc-1': 3, 'dc-2': 3 Current replication strategy of keyspace 'janusgraph' : {u'dc-1': 3} Changed replication strategy for 'janusgraph' to : 'dc-1': 3, 'dc-2': 3 Current replication strategy of keyspace 'system_distributed' : {u'dc-1': 3} Changed replication strategy for 'system_distributed' to : 'dc-1': 3, 'dc-2': 3 Current replication strategy of keyspace 'system_traces' : {u'dc-1': 2} Changed replication strategy for 'system_traces' to : 'dc-1': 2, 'dc-2': 2 sh-4.4$ exit
- For each node in the nonoperational data center, rebuild the node from the operational
data center.
nodetool rebuild -- <operational-data-center>
See the sample output. In this example,backup
is the release name of the nonoperational data center anddc-1
is the name of the operational data center.$ oc rsh backup-cassandra-0-0 Defaulted container "backup-cassandra" out of: backup-cassandra, backup-cassandra-nodetool-api sh-4.4$ nodetool rebuild -- dc-1 sh-4.4$ tail /opt/ibm/cassandra/logs/system.log INFO [RMI TCP Connection(2)-127.0.0.1] 2022-07-21 16:24:21,245 StorageService.java:1281 - rebuild from dc: dc-1, (All keyspaces), (All tokens) INFO [RMI TCP Connection(2)-127.0.0.1] 2022-07-21 16:24:21,299 StreamResultFuture.java:90 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064] Executing streaming plan for Rebuild INFO [StreamConnectionEstablisher:2] 2022-07-21 16:24:21,302 StreamSession.java:282 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064] Starting streaming to /20.20.1.2 INFO [StreamConnectionEstablisher:1] 2022-07-21 16:24:21,302 StreamSession.java:282 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064] Starting streaming to /20.20.1.1 INFO [StreamConnectionEstablisher:1] 2022-07-21 16:24:21,493 StreamCoordinator.java:270 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064, ID#0] Beginning stream session with /20.20.1.1 INFO [StreamConnectionEstablisher:2] 2022-07-21 16:24:21,495 StreamCoordinator.java:270 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064, ID#0] Beginning stream session with /20.20.1.2 INFO [STREAM-IN-/20.20.1.1:7001] 2022-07-21 16:24:21,506 StreamResultFuture.java:173 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064 ID#0] Prepare completed. Receiving 27 files(73.867KiB), sending 0 files(0.000KiB) INFO [STREAM-IN-/20.20.1.2:7001] 2022-07-21 16:24:21,539 StreamResultFuture.java:187 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064] Session with /20.20.1.2 is complete INFO [StreamReceiveTask:1] 2022-07-21 16:24:22,153 StreamResultFuture.java:187 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064] Session with /20.20.1.1 is complete INFO [StreamReceiveTask:1] 2022-07-21 16:24:22,162 StreamResultFuture.java:219 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064] All sessions completed
Note: You can use thenodetool netstats
command to monitor the progress. - Start or scale up the services in the nonoperational data center to the original level.
- Rebroadcast data to inventory in the affected data
center. If data in the inventory service is out of sync with data in the Cassandra database, resynchronize it by calling the rebroadcast API of the topology service. This triggers the rebroadcast of all known resources on Kafka, and the inventory service will then index those resources in PostgreSQL. Call the rebroadcast API of the Topology service, by specifying a tenantId.
For more information, see Backing up topology database data.https://master_fqdn/1.0/topology/swagger#!/Crawlers/rebroadcastTopology