Recovering from a long time site outage

Use these troubleshooting instructions to recover a site from a long time site outage on a geo-redundancy enabled cluster.

About this task

If a geo-redundancy enabled site is down for a period longer than the default value of gc_grace_seconds, then the nodetool repair functionality cannot be used, as the deleted resources would reappear. In such cases, you can recover the site by completing the steps that are listed in the following procedure.
Note: The default value for gc_grace_seconds is 864000 seconds (10 days), and is stored in Cassandra per table in system_schema.tables.

Procedure

  1. Log in to a running Cassandra node in the operational data center.
    oc rsh <release-name>-cassandra-0-0
  2. Run the Python utility script to remove the affected data center. The script takes a while to run.
    python3 /opt/ibm/datacenter.py --remove <affected-data-center-name>
    See the sample output. In this example, primary is the release name of the operational data center and dc-2 is the name of the affected data center.
    $ oc rsh primary-cassandra-0-0
    sh-4.4$ python3 /opt/ibm/datacenter.py --remove dc-2
    Client encryption enabled
    Connected : 'localhost'
    Current replication strategy of keyspace 'system_distributed' : {'dc-1': 3, 'dc-2': 3}
    Changed replication strategy for 'system_distributed' to : 'dc-1': 3
    Current replication strategy of keyspace 'janusgraph' : {'dc-1': 3, 'dc-2': 3}
    Changed replication strategy for 'janusgraph' to : 'dc-1': 3
    Current replication strategy of keyspace 'system_traces' : {'dc-1': 2, 'dc-2': 2}
    Changed replication strategy for 'system_traces' to : 'dc-1': 2
    Current replication strategy of keyspace 'system_auth' : {'dc-1': 3, 'dc-2': 3}
    Changed replication strategy for 'system_auth' to : 'dc-1': 3
    Datacenter: dc-1
    ================
    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving
    --  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
    UN  20.20.1.1  375.9 KiB  16           100.0%            496db012-a177-4929-a3b9-44f9ee088285  rack-1
    UN  20.20.1.2  235.05 KiB  16           100.0%            12705a85-23a1-4cbe-935e-6aa3bdbe3a45  rack-1
    UN  20.20.1.3  259.1 KiB  16           100.0%            17136c9a-76e6-4c3b-ae40-74546d927e4a  rack-1
    Datacenter: dc-2
    ================
    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving
    --  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
    DN  20.20.2.1  342.49 KiB  16           0.0%              3dcf8b32-0773-427b-8856-b0e085e0a5c2  rack-1
    DN  20.20.2.2  337.68 KiB  16           0.0%              d26d8ed5-a179-44e7-a660-e63dd2929daa  rack-1
    DN  20.20.2.3  311.86 KiB  16           0.0%              5086a32d-82aa-4d3a-b11b-8099fcdcaa33  rack-1
    
    Removing node with host id : 3dcf8b32-0773-427b-8856-b0e085e0a5c2
    Removing node with host id : d26d8ed5-a179-44e7-a660-e63dd2929daa
    Removing node with host id : 5086a32d-82aa-4d3a-b11b-8099fcdcaa33
    Datacenter: dc-1
    ================
    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving
    --  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
    UN  20.20.1.1  375.89 KiB  16           100.0%            496db012-a177-4929-a3b9-44f9ee088285  rack-1
    UN  20.20.1.2  235.36 KiB  16           100.0%            12705a85-23a1-4cbe-935e-6aa3bdbe3a45  rack-1
    UN  20.20.1.3  265.27 KiB  16           100.0%            17136c9a-76e6-4c3b-ae40-74546d927e4a  rack-1
    
    sh-4.4$ exit 
  3. If the Cassandra cluster is running, shut down the Cassandra cluster in the nonoperational data center. Then, delete the Cassandra cluster data from the nonoperational data center.
    You can use different storage locations when you deploy Netcool® Operations Insight® on OpenShift®. Different storage locations have different ways to delete data. For local storage, the Persistent Volume Claim (PVC) and data from the physical volumes need to be deleted. For some configurations of Red Hat® Ceph®, deleting the PVC is enough.
  4. Start the nonoperational data center.
  5. Wait for either node 0 or node 1 to completely start (Ready 2/2).
  6. Edit the statefulset for the node that is not started.
    oc edit sts <release-name>-cassandra-<node>
  7. Change the Cassandra seeds to include all seeds, and add the START_NOW value.
    
    - name: CASSANDRA_SEEDS
      value: <IP-address-of-primary-node 0>,<IP-address-of-primary-node 1>,<IP-address-of-backup-node 0>,<IP-address-of-backup-node 1>
    - name: CASSANDRA_START_NOW
      value: "true"
    Note: You can obtain the list of all seeds from any running node by running the oc logs <release-name>-cassandra-0-0 -c <release-name>-cassandra | grep 'Seed list' command.
  8. Wait for the node to be ready (Ready 2/2).
  9. Repeat steps 6, 7, and 8 for node 2.
  10. Run the Python utility script on a running node within the operational data center, such as the Cassandra pod, to add back the nonoperational data center.
    /opt/ibm/datacenter.py --add <previously-affected-data-center>
    See the sample output. In this example, primary is the release name of the operational data center and dc-2 is the name of the nonoperational data center.
    $ oc rsh primary-cassandra-0-0
    sh-4.4$ python /opt/ibm/datacenter.py --add dc-2
    Client encryption enabled
    Connected : 'localhost'
    Current replication strategy of keyspace 'system_auth' : {u'dc-1': 3}
    Changed replication strategy for 'system_auth' to : 'dc-1': 3, 'dc-2': 3
    Current replication strategy of keyspace 'janusgraph' : {u'dc-1': 3}
    Changed replication strategy for 'janusgraph' to : 'dc-1': 3, 'dc-2': 3
    Current replication strategy of keyspace 'system_distributed' : {u'dc-1': 3}
    Changed replication strategy for 'system_distributed' to : 'dc-1': 3, 'dc-2': 3
    Current replication strategy of keyspace 'system_traces' : {u'dc-1': 2}
    Changed replication strategy for 'system_traces' to : 'dc-1': 2, 'dc-2': 2
    sh-4.4$ exit
  11. For each node in the nonoperational data center, rebuild the node from the operational data center.
    nodetool rebuild -- <operational-data-center>
    See the sample output. In this example, backup is the release name of the nonoperational data center and dc-1 is the name of the operational data center.
    $ oc rsh backup-cassandra-0-0
    Defaulted container "backup-cassandra" out of: backup-cassandra, backup-cassandra-nodetool-api
    sh-4.4$ nodetool rebuild -- dc-1
    sh-4.4$ tail /opt/ibm/cassandra/logs/system.log 
    INFO  [RMI TCP Connection(2)-127.0.0.1] 2022-07-21 16:24:21,245 StorageService.java:1281 - rebuild from dc: dc-1, (All keyspaces), (All tokens)
    INFO  [RMI TCP Connection(2)-127.0.0.1] 2022-07-21 16:24:21,299 StreamResultFuture.java:90 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064] Executing streaming plan for Rebuild
    INFO  [StreamConnectionEstablisher:2] 2022-07-21 16:24:21,302 StreamSession.java:282 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064] Starting streaming to /20.20.1.2
    INFO  [StreamConnectionEstablisher:1] 2022-07-21 16:24:21,302 StreamSession.java:282 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064] Starting streaming to /20.20.1.1
    INFO  [StreamConnectionEstablisher:1] 2022-07-21 16:24:21,493 StreamCoordinator.java:270 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064, ID#0] Beginning stream session with /20.20.1.1
    INFO  [StreamConnectionEstablisher:2] 2022-07-21 16:24:21,495 StreamCoordinator.java:270 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064, ID#0] Beginning stream session with /20.20.1.2
    INFO  [STREAM-IN-/20.20.1.1:7001] 2022-07-21 16:24:21,506 StreamResultFuture.java:173 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064 ID#0] Prepare completed. Receiving 27 files(73.867KiB), sending 0 files(0.000KiB)
    INFO  [STREAM-IN-/20.20.1.2:7001] 2022-07-21 16:24:21,539 StreamResultFuture.java:187 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064] Session with /20.20.1.2 is complete
    INFO  [StreamReceiveTask:1] 2022-07-21 16:24:22,153 StreamResultFuture.java:187 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064] Session with /20.20.1.1 is complete
    INFO  [StreamReceiveTask:1] 2022-07-21 16:24:22,162 StreamResultFuture.java:219 - [Stream #9385d8f0-0911-11ed-994f-3162bd2d4064] All sessions completed
    Note: You can use the nodetool netstats command to monitor the progress.
  12. Start or scale up the services in the nonoperational data center to the original level.
  13. Rebroadcast data to inventory in the affected data center.
    If data in the inventory service is out of sync with data in the Cassandra database, resynchronize it by calling the rebroadcast API of the topology service. This triggers the rebroadcast of all known resources on Kafka, and the inventory service will then index those resources in PostgreSQL. Call the rebroadcast API of the Topology service, by specifying a tenantId.
    https://master_fqdn/1.0/topology/swagger#!/Crawlers/rebroadcastTopology
    For more information, see Backing up topology database data.