Troubleshooting restoration of management database

You can troubleshoot problems with restoring the management database.

Review the information on this page to understand the steps you can take to troubleshoot a failed restore. Be sure to first read the Overview of restore process for management database to understand the logging and error reporting.

Note that for Version 2018.4.1.10 there is a known limitation with error reporting. See Known limitation: failed restores not reporting properly.

If you cannot resolve a failed restore, contact IBM Support for assistance.

Overview of restore process for management database

The API Connect restore process is started by the command apicup subsys exec <management_subsys> restore <backupID>. The apicup installer starts a Kubernetes job named restore-<id>, which in turn starts a pod named restore-<job-id>-<pod-id>.

API Connect restore process

The management restoration pod consists of:

  • cassandra-restore -- The main responsibility of this init container is to perform the Cassandra database restore. If this container encounters an error, the restoration job is Failed with pod status of InitError (0/2).
    Note: For Version 2018.4.1.10 and earlier, this container is called job-container.
  • Cassandra-health-check - An init container that verifies that the Cassandra health status is in a healthy state. .
  • Lur-upgrade-job - This container checks for schema mismatches between restored data and the currently running installation. If necessary, the container performs a schema upgrade.
  • Apim-upgrade-job - This container checks for schema mismatches between restored data and the currently running installation. If necessary, the container performs a schema upgrade. The container also resyncs all the gateway services.

The init containers start sequentially. The second init container starts only once the first init container has succeeded. Containers inside the pod start immediately after all the init containers in the pod are started. For more info, see the Kubernetes documentation: Understanding pod status

Cassandra restore process
The Cassandra restore process is performed by the init container cassandra-restore. Backups are required from each Cassandra pod. The container process flow is:
  1. Perform preliminary retrieval checks.
  2. Download the backup tar file onto the Cassandra container.
  3. Verify the backup tar file integrity.
  4. Stop the Cassandra process.
  5. Perform the Cassandra restore.
  6. Start the Cassandra process with restored data.

If any of the preliminary checks fail, error messages are returned. The error conditions must be fixed, and then the restore process can be run again.

Note that API Connect Version 2018.4.1.10 (and later) performs extensive preliminary checks prior to beginning a restore. Running the checks extends the time required to complete a restore. In particular, restoration of large backup files (larger than 5 GB) take longer.

Logging

Both init container and container logs are available during the restore process. To obtain logs from a restore pod:

kubectl logs <restore-pod> -n <namespace> -c <init container/container name> 

The apicup command produces logs from all init and main containers sequentially as soon they finish, either successfully or with errors.

Cassandra restore logging

The logs for the init container are updated only upon completion (success or failure). To get accurate status information for the current state of the Cassandra restore process, review the CLusterRestoreStatus field in the CassandraClusters Custom Resource:

kubectl get cc -n <namespace> -o yaml | grep -A 2 ClusterRestoreStatus

Error: invalid backup host credentials

Example output when this problem occurs:

./apicup subsys exec mgmt restore 158326828353460927
			
Cluster                     Namespace   Backup Name           Backup Retrieval Timeout(hrs)   Status
rdd94fb4a21-apiconnect-cc   niharns1    1583268283534609276   24                              Started
													
Restore failed											
Error: rpc error: code = Unknown desc =								
Pod name: rdd94fb4a21-apiconnect-cc-0						
		
Error:												
 ssh: Could not resolve hostname 9.30.251.186.xxx: Name or service not known		
Couldn't read packet: Connection reset by peer							
**** [ Wed Mar  4 04:04:25 UTC 2020 ]prelimRetrieve 0: Preliminary Retrieve checks	 FAILED ****																								
[ Wed Mar  4 04:04:25 UTC 2020 ] Preliminary retrieve checks failed on all 1 attempts.    ABORTING restore

Error: unable to get log stream for container cassandra-health-status, pod restore-hhrsc	
	-2slbm, job restore-hhrsc: container "cassandra-health-status" in pod "restore-hhrsc
	-2slbm" is waiting to start: PodInitializing		

ClusterRestoreStatus in Cassandra Clusters CR:

kubectl get cc -o yaml | grep -A 1 ClusterRestoreStatus						
 ClusterRestoreStatus: 'Restore Failed: Retrieve preliminary checks failed for rdd94fb4a21-apiconnect-cc-0'

kubectl get pods | grep restore						
 restore-hhrsc-2slbm                           0/2     Init:Error   0          26m

kubectl get jobs | grep restore					
 restore-hhrsc                                 0/1           27m        27m	

Error: insufficient disk space

Issue: cc-0 prelim checks can reject restore process complaining about insufficient disk space.

Workaround: Increase the disk space (4X size of Cassandra backup tar file size) allocated to all Cassandra nodes on a fresh install and re-attempt restore. See Freespace check.

Error: failure on preliminary backup checks on cc-1 and cc-2

Issue: Failure on preliminary backup checks on cc-1 and cc-2 can leave the system in non-ready state

Workaround:
  1. Always look at ClusterRestoreStatus in CassandraClusters Custom Resource for failed or completed Cassandra restore status, regardless of restore job status. In this case, the restore job is stuck on health status check, and Cassandra cc-0 will be in a non-ready state.
  2. Execute the following command on each Cassandra pod sequentially starting with cc-0. Wait for the Cassandra pod to become ready (1/1) before executing on other Cassandra pods. If the Cassandra pod is already in ready state (1/1), you do not need to wait, just run the command.
    kubectl exec -it <cassandra-pod-X> -n <namespace> 
    -- sh -c 'rm -rf /var/db/.restore && rm -rf /var/db/restore/*'

    In the above command the X in <cassandra-pod-X> stands for the numerical value (ordinal) of the Cassandra pod, such as (0,1,2).

  3. Review the Cassandra operator logs to figure out why restore has failed and fix the problem.

    To see an example of this type of failure, review Example 2: Corrupted backup tar file cc-1 not reporting properly.

Known limitation: failed restores not reporting properly

Issue: Failure on preliminary backup checks on cc-0, such as a corrupt tar file or incomplete download of a backup tar file, can cause the restore job to complete, but the underlying ClusterRestoreStatus is marked as Restore Failed: <Reason>.

Workaround: Always check ClusterRestoreStatus in the CassandraClusters custom resource for failed or completed Cassandra restore status, regardless of restore job status. Consult the Cassandra operator logs to figure out why restore failed, and fix the problem.

Example 1: Corrupted backup tar file cc-0 not reporting properly

In this example, restoration was started with apicup:

./apicup subsys exec restore 1583268283534609276
  1. View the initial status of restore job and restore pod:
    kubectl get jobs | grep restore				
    restore-xs85r                                 0/1           38s        38s           
    
    kn get pods | grep restore						
    restore-xs85r-st554                           0/2     Init:1/2     0          92s
    
  2. Note that the restore completes, according to the job and pod status:
    kubectl get pods | grep restore							
    restore-xs85r-st554                           0/2     Completed    0          3m43s
    
    kubectl get jobs | grep restore					
    restore-xs85r                                 1/1           2m51s      4m1s  
    
  3. Observe, however, that the apicup restore command output includes the following failure status:
    ./apicup subsys exec mgmt restore 1583268283534609276					
    Cluster                     Namespace   Backup Name           Backup Retrieval Timeout(hrs)   Status
    rdd94fb4a21-apiconnect-cc   niharns1    1583268283534609276   24                              Started
    													
    Cluster                     Namespace   Backup Name           Status
    rdd94fb4a21-apiconnect-cc   niharns1    1583268283534609276   Restore Failed: Backup retrieve checks failed
    
  4. Check the value of ClusterRestoreStatus in the Custom Resource. Note that ClusterRestoreStatus is marked as Restore Failed.
    kubectl get cc -o yaml | grep -A 1 ClusterRestoreStatus			
      ClusterRestoreStatus: 'Restore Failed: Backup retrieve checks failed'
    
  5. Check the Cassandra operator pod logs and see errors:
    time="2020-03-04T04:30:33Z" level=error msg="Restore Failed: Backup retrieve checks failed with an error: \n
    Pod name: rdd94fb4a21-apiconnect-cc-0\n
    Error: \n
    gzip: stdin: not in gzip format\n
    tar: Child returned status 1\n
    tar: Error is not recoverable: exiting now\n
    retrieveCheck 0: RetrieveCheck of backup file 1583268283534609276-0-3.tar.gz FAILED\n
    **** [ Wed Mar  4 04:30:32 UTC 2020 ] retrieveCheck 0: Retrieve process END ****\n\n"									
    time="2020-03-04T04:30:33Z" level=info msg="Updating status to Restore Failed: Backup retrieve checks failed"
    

    Explanation: Even though restore job says it completed, Cassandra restore never completed due to corrupt backup tar file. The restore job believes it reached completion, but in this case the error logging is incorrect. Therefore, you must check ClusterRestoreStatus in CassandraCluster Custom Resource to see if restore really completed. Note that CassandraClusterRestore status does not state exactly why restore failed, hence you must look at the Cassandra operator pod logs.

    Table 1. Limitation with error reporting when Cassandra tar file for cc-0 is corrupted
    Expected flow Actual flow
    1. Cassandra operator detects that cc-0 backup is corrupt 1. Cassandra operator detects that cc-0 backup is corrupt
    2. Operator updates ClusterRestoreStatus as Restore Failed 2. Operator updates ClusterRestoreStatus as Restore Failed
    3.Operator passes an error message back to restore init container (cassandra-restore). 3. Operator does not pass an error message back to restore init container (cassandra-restore)
    4. apicup restore command line should exit with an error message on why restore init container (cassandra-restore) failed. Restore pod status should be Init:Error. 4. Restore pod moves forward and executes other init container and upgrade containers.
    5. Restore job is marked as Failed 5. Restore pod is marked as completed
  6. Workaround: The Cassandra cluster remains up and running as restore process failed internally, so you must review the Cassandra operator logs to figure out why exactly Cassandra restore failed. In this case, Cassandra restore failed due to corrupt backup tar file. Locate and use a non-corrupted Cassandra backup.

Example 2: Corrupted backup tar file cc-1 not reporting properly

In this example, restoration was started with apicup:

./apicup subsys exec restore 1583268283534609276
  1. View the initial status of restore job and restore pod
    kubectl get jobs | grep restore
    restore-v7nvt 0/1 81s 81s
    
    kubectl get pods | grep restore
    restore-v7nvt-2jdzq 0/2 Init:0/2 0 101s
  2. Continue to view pods, and observe that the restore pod is stuck.

    The restore pod first init container (cassandra-restore), which performs the Cassandra restore, is marked as complete. The second init container is stuck waiting for the Cassandra cluster to become healthy.

    kn get pods | grep restore
    restore-v7nvt-2jdzq 0/2 Init:1/2 0 4m8s

    The status Init:1/2 means that first init container completed, but the process is waiting on second init container to finish.

  3. Note that the output from the apicup restore command gives a status of Restore Failed:
    ./apicup subsys exec mgmt restore 1583268283534609276
    Cluster                     Namespace   Backup Name           Backup Retrieval Timeout(hrs)   Status
    rdd94fb4a21-apiconnect-cc   niharns1    1583268283534609276   24                              Started
    
    Cluster                     Namespace   Backup Name           Status
    rdd94fb4a21-apiconnect-cc   niharns1    1583268283534609276   Restore Failed: Backup retrieve checks failed
    
  4. View the ClusterRestoreStatus in the Cassandra Cluster Custom Resource:
        kn get cc -o yaml | grep -A 1 ClusterRestoreStatus
        ClusterRestoreStatus: 'Restore Failed: Backup retrieve checks failed'
    
  5. View the Cassandra operator pod logs:
    time="2020-03-04T05:04:18Z" level=error msg="Restore Failed: Backup retrieve checks failed with an error: \n
    Pod name: rdd94fb4a21-apiconnect-cc-1\n
    Error: \ngzip: stdin: not in gzip format\n
    tar: Child returned status 1\n
    tar: Error is not recoverable: exiting now\n
    retrieveCheck 0: RetrieveCheck of backup file 1583268283534609276-1-3.tar.gz FAILED\n
    **** [ Wed Mar  4 05:04:18 UTC 2020 ] retrieveCheck 0: Retrieve process END ****\n\n"
    time="2020-03-04T05:04:18Z" level=info msg="Updating status to Restore Failed: Backup retrieve checks failed"
    
  6. View the Cassandra pod status:
    rdd94fb4a21-apiconnect-cc-0                         0/1     Running      3          11h
    rdd94fb4a21-apiconnect-cc-1                         1/1     Running      2          11h
    rdd94fb4a21-apiconnect-cc-2                         1/1     Running      2          11h
    

    Explanation: In this scenario, the restore process is stuck waiting for the Cassandra cluster to become healthy. Cassandra pod cc-0 is in a non-ready state. Due to a known limitation with error logging, the only way to accurately determine whether the restore is complete is to view ClusterRestoreStatus in CassandraCluster Custom Resource. Since CassandraClusterRestore status does not state exactly why the restore failed, you must look at Cassandra operator pod logs.

    Table 2. Limitation with error reporting when corrupted Cassandra backup tar file cc-1 is corrupted
    Expected flow Actual flow
    1. Cassandra operator detects that cc-1 backup is corrupt 1. Cassandra operator detects that cc-1 backup is corrupt
    2. Operator updates ClusterRestoreStatus as Restore Failed. 2. Operator updates ClusterRestoreStatus as Restore Failed.
    3. Operator cleans up the restore process content and makes sure that Cassandra cluster is healthy using the existing data. 3. Operator does not perform any cleanup and does not makes sure that the Cassandra cluster is in a healthy state
    4. Operator passes an error message back to restore init container (cassandra-restore) to indicate that the restore has failed. 4. Operator does not pass an error message back to restore init container (cassandra-restore) to indicate that the restore has failed.
    5. The apicup restore command line should exit with a proper error message on why restore init container (cassandra-restore) failed. The restore pod status should be Init:Error. The restore job should be marked as Failed. 5. Restore pod moves forward from the init container but becomes stuck on second init container.
  7. Workaround:
    1. Since the Cassandra cluster is in a degraded state, and the Cassandra pod cc-0 is in a non-ready state (0/1), run the following command to bring the Cassandra cluster up and running.
      kubectl exec -it <cassandra-pod-X> -n <namespace> -- sh -c 
      'rm -rf /var/db/.restore && rm -rf /var/db/restore/*’

      Note that you must run this command sequentially on all Cassandra pods, starting with cc-0. Make sure that the cc-x status is (1/1), before running this command on the next pod (cc-x+1).

    2. Review the Cassandra operator logs to determine why the Cassandra restore failed. In this scenario, since the Cassandra restore failed due to a corrupt backup tar file, the solution is to choose a non-corrupted Cassandra backup.

    Note that this Cassandra restore issue applies not only specifically to a corrupted backup tar file, but also to any of the restore issues which may happen on Cassandra-x+1 pods (x is a numerical value starting with 0) and can leave previous Cassandra pods in a non-ready state.

Frequently asked questions

Where can I find the status of the restore process?
See: Cassandra restore logging
In what cases can I re-run the restore process on existing installations?
The answer depends on which stage of the restore failed. If the restore process failed because of a corrupted tar file, you can re-initiate the restore process using a different backup ID.
In what cases I need to redeploy a whole new cluster before re-attempting a failed restore?
If the restore failed due to space allocation problems or any other reasons except corrupted backup tar, a complete new installation is needed. You must fix the problem reported by the restore process.
What action to take if I get the error message: Not enough space to download the tar file?
You must re-install the management subsystem, allocating sufficient disk space. Please look at system requirements section for recommended disk sizes.
What if the downloaded backup tar file is corrupted?
Use a different backup. Run backup tar integrity checks prior to attempting the restore.

If the backup tar integrity succeeds in your local system but restore process is failing, gather the required logs and contact IBM Support.

What if there is not enough space to perform restore using the downloaded backup tar file?
You must re-install the management subsystem, allocating sufficient disk space. See Error: insufficient disk space.
What if the restore job finishes but I still don't see any restored data?
See Known limitation: failed restores not reporting properly.
What if the restore process remains stuck on the health check status for a very long time?
This might be due to a known limitation. See Example 2: Corrupted backup tar file cc-1 not reporting properly.
What if the restore process fails on Lur-upgrade-job and Apim-upgrade-job containers?
Run the restore process again. If the error persists, contact IBM Support.