Troubleshooting restoration of management database
You can troubleshoot problems with restoring the management database.
Review the information on this page to understand the steps you can take to troubleshoot a failed restore. Be sure to first read the Overview of restore process for management database to understand the logging and error reporting.
Note that for Version 2018.4.1.10 there is a known limitation with error reporting. See Known limitation: failed restores not reporting properly.
- Overview of restore process for management database
- Error: invalid backup host credentials
- Error: insufficient disk space
- Error: failure on preliminary backup checks on cc-1 and cc-2
- Known limitation: failed restores not reporting properly
- Frequently asked questions
If you cannot resolve a failed restore, contact IBM Support for assistance.
Overview of restore process for management database
The API Connect restore process is started by the command apicup subsys exec
<management_subsys> restore <backupID>
. The
apicup installer starts a Kubernetes job named restore-<id>
,
which in turn starts a pod named
restore-<job-id>-<pod-id>
.
- API Connect restore process
-
The management restoration pod consists of:
- cassandra-restore -- The main responsibility of this init container is to perform the Cassandra
database restore. If this container encounters an error, the restoration job is Failed with pod
status of
InitError (0/2)
.Note: For Version 2018.4.1.10 and earlier, this container is called job-container. - Cassandra-health-check - An init container that verifies that the Cassandra health status is in a healthy state. .
- Lur-upgrade-job - This container checks for schema mismatches between restored data and the currently running installation. If necessary, the container performs a schema upgrade.
- Apim-upgrade-job - This container checks for schema mismatches between restored data and the currently running installation. If necessary, the container performs a schema upgrade. The container also resyncs all the gateway services.
The init containers start sequentially. The second init container starts only once the first init container has succeeded. Containers inside the pod start immediately after all the init containers in the pod are started. For more info, see the Kubernetes documentation: Understanding pod status
- cassandra-restore -- The main responsibility of this init container is to perform the Cassandra
database restore. If this container encounters an error, the restoration job is Failed with pod
status of
- Cassandra restore process
- The Cassandra restore process is performed by the init container cassandra-restore. Backups are
required from each Cassandra pod. The container process flow is:
- Perform preliminary retrieval checks.
- Download the backup tar file onto the Cassandra container.
- Verify the backup tar file integrity.
- Stop the Cassandra process.
- Perform the Cassandra restore.
- Start the Cassandra process with restored data.
If any of the preliminary checks fail, error messages are returned. The error conditions must be fixed, and then the restore process can be run again.
Note that API Connect Version 2018.4.1.10 (and later) performs extensive preliminary checks prior to beginning a restore. Running the checks extends the time required to complete a restore. In particular, restoration of large backup files (larger than 5 GB) take longer.
- Logging
-
Both init container and container logs are available during the restore process. To obtain logs from a restore pod:
kubectl logs <restore-pod> -n <namespace> -c <init container/container name>
The
apicup
command produces logs from all init and main containers sequentially as soon they finish, either successfully or with errors. - Cassandra restore logging
-
The logs for the init container are updated only upon completion (success or failure). To get accurate status information for the current state of the Cassandra restore process, review the
CLusterRestoreStatus
field in theCassandraClusters
Custom Resource:kubectl get cc -n <namespace> -o yaml | grep -A 2 ClusterRestoreStatus
Error: invalid backup host credentials
Example output when this problem occurs:
./apicup subsys exec mgmt restore 158326828353460927
Cluster Namespace Backup Name Backup Retrieval Timeout(hrs) Status
rdd94fb4a21-apiconnect-cc niharns1 1583268283534609276 24 Started
Restore failed
Error: rpc error: code = Unknown desc =
Pod name: rdd94fb4a21-apiconnect-cc-0
Error:
ssh: Could not resolve hostname 9.30.251.186.xxx: Name or service not known
Couldn't read packet: Connection reset by peer
**** [ Wed Mar 4 04:04:25 UTC 2020 ]prelimRetrieve 0: Preliminary Retrieve checks FAILED ****
[ Wed Mar 4 04:04:25 UTC 2020 ] Preliminary retrieve checks failed on all 1 attempts. ABORTING restore
Error: unable to get log stream for container cassandra-health-status, pod restore-hhrsc
-2slbm, job restore-hhrsc: container "cassandra-health-status" in pod "restore-hhrsc
-2slbm" is waiting to start: PodInitializing
ClusterRestoreStatus
in Cassandra Clusters CR:
kubectl get cc -o yaml | grep -A 1 ClusterRestoreStatus
ClusterRestoreStatus: 'Restore Failed: Retrieve preliminary checks failed for rdd94fb4a21-apiconnect-cc-0'
kubectl get pods | grep restore
restore-hhrsc-2slbm 0/2 Init:Error 0 26m
kubectl get jobs | grep restore
restore-hhrsc 0/1 27m 27m
Error: insufficient disk space
Issue: cc-0 prelim checks can reject restore process complaining about insufficient disk space.
Workaround: Increase the disk space (4X size of Cassandra backup tar file size) allocated to all Cassandra nodes on a fresh install and re-attempt restore. See Freespace check.
Error: failure on preliminary backup checks on cc-1 and cc-2
Issue: Failure on preliminary backup checks on cc-1 and cc-2 can leave the system in non-ready state
- Always look at
ClusterRestoreStatus
inCassandraClusters
Custom Resource for failed or completed Cassandra restore status, regardless of restore job status. In this case, the restore job is stuck on health status check, and Cassandra cc-0 will be in a non-ready state. - Execute the following command on each Cassandra pod sequentially starting with cc-0. Wait for
the Cassandra pod to become ready (1/1) before executing on other Cassandra pods. If the Cassandra
pod is already in ready state (1/1), you do not need to wait, just run the
command.
kubectl exec -it <cassandra-pod-X> -n <namespace> -- sh -c 'rm -rf /var/db/.restore && rm -rf /var/db/restore/*'
In the above command the X in <cassandra-pod-X> stands for the numerical value (ordinal) of the Cassandra pod, such as (0,1,2).
- Review the Cassandra operator logs to figure out why restore has failed and fix the problem.
To see an example of this type of failure, review Example 2: Corrupted backup tar file cc-1 not reporting properly.
Known limitation: failed restores not reporting properly
Issue: Failure on preliminary backup checks on cc-0, such as a corrupt tar file or
incomplete download of a backup tar file, can cause the restore job to complete, but the underlying
ClusterRestoreStatus
is marked as Restore Failed:
<Reason>
.
Workaround: Always check ClusterRestoreStatus
in the
CassandraClusters
custom resource for failed or completed Cassandra restore status,
regardless of restore job status. Consult the Cassandra operator logs to figure out why restore
failed, and fix the problem.
- Example 1: Corrupted backup tar file cc-0 not reporting properly
-
In this example, restoration was started with
apicup
:./apicup subsys exec restore 1583268283534609276
- View the initial status of restore job and restore pod:
kubectl get jobs | grep restore restore-xs85r 0/1 38s 38s kn get pods | grep restore restore-xs85r-st554 0/2 Init:1/2 0 92s
- Note that the restore completes, according to the job and pod
status:
kubectl get pods | grep restore restore-xs85r-st554 0/2 Completed 0 3m43s kubectl get jobs | grep restore restore-xs85r 1/1 2m51s 4m1s
- Observe, however, that the
apicup
restore command output includes the following failure status:./apicup subsys exec mgmt restore 1583268283534609276 Cluster Namespace Backup Name Backup Retrieval Timeout(hrs) Status rdd94fb4a21-apiconnect-cc niharns1 1583268283534609276 24 Started Cluster Namespace Backup Name Status rdd94fb4a21-apiconnect-cc niharns1 1583268283534609276 Restore Failed: Backup retrieve checks failed
- Check the value of
ClusterRestoreStatus
in the Custom Resource. Note thatClusterRestoreStatus
is marked asRestore Failed
.kubectl get cc -o yaml | grep -A 1 ClusterRestoreStatus ClusterRestoreStatus: 'Restore Failed: Backup retrieve checks failed'
- Check the Cassandra operator pod logs and see
errors:
time="2020-03-04T04:30:33Z" level=error msg="Restore Failed: Backup retrieve checks failed with an error: \n Pod name: rdd94fb4a21-apiconnect-cc-0\n Error: \n gzip: stdin: not in gzip format\n tar: Child returned status 1\n tar: Error is not recoverable: exiting now\n retrieveCheck 0: RetrieveCheck of backup file 1583268283534609276-0-3.tar.gz FAILED\n **** [ Wed Mar 4 04:30:32 UTC 2020 ] retrieveCheck 0: Retrieve process END ****\n\n" time="2020-03-04T04:30:33Z" level=info msg="Updating status to Restore Failed: Backup retrieve checks failed"
Explanation: Even though restore job says it completed, Cassandra restore never completed due to corrupt backup tar file. The restore job believes it reached completion, but in this case the error logging is incorrect. Therefore, you must check
ClusterRestoreStatus
inCassandraCluster
Custom Resource to see if restore really completed. Note thatCassandraClusterRestore
status does not state exactly why restore failed, hence you must look at the Cassandra operator pod logs.Table 1. Limitation with error reporting when Cassandra tar file for cc-0 is corrupted Expected flow Actual flow 1. Cassandra operator detects that cc-0 backup is corrupt 1. Cassandra operator detects that cc-0 backup is corrupt 2. Operator updates ClusterRestoreStatus
asRestore Failed
2. Operator updates ClusterRestoreStatus as Restore Failed
3.Operator passes an error message back to restore init container (cassandra-restore). 3. Operator does not pass an error message back to restore init container (cassandra-restore) 4. apicup
restore command line should exit with an error message on why restore init container (cassandra-restore) failed. Restore pod status should beInit:Error
.4. Restore pod moves forward and executes other init container and upgrade containers. 5. Restore job is marked as Failed 5. Restore pod is marked as completed - Workaround: The Cassandra cluster remains up and running as restore process failed internally, so you must review the Cassandra operator logs to figure out why exactly Cassandra restore failed. In this case, Cassandra restore failed due to corrupt backup tar file. Locate and use a non-corrupted Cassandra backup.
- View the initial status of restore job and restore pod:
- Example 2: Corrupted backup tar file cc-1 not reporting properly
-
In this example, restoration was started with
apicup
:./apicup subsys exec restore 1583268283534609276
- View the initial status of restore job and restore pod
kubectl get jobs | grep restore restore-v7nvt 0/1 81s 81s kubectl get pods | grep restore restore-v7nvt-2jdzq 0/2 Init:0/2 0 101s
- Continue to view pods, and observe that the restore pod is stuck.
The restore pod first init container (cassandra-restore), which performs the Cassandra restore, is marked as complete. The second init container is stuck waiting for the Cassandra cluster to become healthy.
kn get pods | grep restore restore-v7nvt-2jdzq 0/2 Init:1/2 0 4m8s
The status
Init:1/2
means that first init container completed, but the process is waiting on second init container to finish. - Note that the output from the
apicup
restore command gives a status ofRestore Failed
:./apicup subsys exec mgmt restore 1583268283534609276 Cluster Namespace Backup Name Backup Retrieval Timeout(hrs) Status rdd94fb4a21-apiconnect-cc niharns1 1583268283534609276 24 Started Cluster Namespace Backup Name Status rdd94fb4a21-apiconnect-cc niharns1 1583268283534609276 Restore Failed: Backup retrieve checks failed
- View the
ClusterRestoreStatus
in the Cassandra Cluster Custom Resource:kn get cc -o yaml | grep -A 1 ClusterRestoreStatus ClusterRestoreStatus: 'Restore Failed: Backup retrieve checks failed'
- View the Cassandra operator pod logs:
time="2020-03-04T05:04:18Z" level=error msg="Restore Failed: Backup retrieve checks failed with an error: \n Pod name: rdd94fb4a21-apiconnect-cc-1\n Error: \ngzip: stdin: not in gzip format\n tar: Child returned status 1\n tar: Error is not recoverable: exiting now\n retrieveCheck 0: RetrieveCheck of backup file 1583268283534609276-1-3.tar.gz FAILED\n **** [ Wed Mar 4 05:04:18 UTC 2020 ] retrieveCheck 0: Retrieve process END ****\n\n" time="2020-03-04T05:04:18Z" level=info msg="Updating status to Restore Failed: Backup retrieve checks failed"
- View the Cassandra pod
status:
rdd94fb4a21-apiconnect-cc-0 0/1 Running 3 11h rdd94fb4a21-apiconnect-cc-1 1/1 Running 2 11h rdd94fb4a21-apiconnect-cc-2 1/1 Running 2 11h
Explanation: In this scenario, the restore process is stuck waiting for the Cassandra cluster to become healthy. Cassandra pod cc-0 is in a non-ready state. Due to a known limitation with error logging, the only way to accurately determine whether the restore is complete is to view
ClusterRestoreStatus
inCassandraCluster
Custom Resource. SinceCassandraClusterRestore
status does not state exactly why the restore failed, you must look at Cassandra operator pod logs.Table 2. Limitation with error reporting when corrupted Cassandra backup tar file cc-1 is corrupted Expected flow Actual flow 1. Cassandra operator detects that cc-1 backup is corrupt 1. Cassandra operator detects that cc-1 backup is corrupt 2. Operator updates ClusterRestoreStatus
asRestore Failed
.2. Operator updates ClusterRestoreStatus
asRestore Failed
.3. Operator cleans up the restore process content and makes sure that Cassandra cluster is healthy using the existing data. 3. Operator does not perform any cleanup and does not makes sure that the Cassandra cluster is in a healthy state 4. Operator passes an error message back to restore init container (cassandra-restore) to indicate that the restore has failed. 4. Operator does not pass an error message back to restore init container (cassandra-restore) to indicate that the restore has failed. 5. The apicup
restore command line should exit with a proper error message on why restore init container (cassandra-restore) failed. The restore pod status should beInit:Error
. The restore job should be marked as Failed.5. Restore pod moves forward from the init container but becomes stuck on second init container. - Workaround:
- Since the Cassandra cluster is in a degraded state, and the Cassandra pod cc-0 is in a non-ready
state (
0/1
), run the following command to bring the Cassandra cluster up and running.kubectl exec -it <cassandra-pod-X> -n <namespace> -- sh -c 'rm -rf /var/db/.restore && rm -rf /var/db/restore/*’
Note that you must run this command sequentially on all Cassandra pods, starting with cc-0. Make sure that the cc-x status is
(1/1)
, before running this command on the next pod (cc-x+1). - Review the Cassandra operator logs to determine why the Cassandra restore failed. In this scenario, since the Cassandra restore failed due to a corrupt backup tar file, the solution is to choose a non-corrupted Cassandra backup.
Note that this Cassandra restore issue applies not only specifically to a corrupted backup tar file, but also to any of the restore issues which may happen on Cassandra-x+1 pods (
x
is a numerical value starting with0
) and can leave previous Cassandra pods in a non-ready state. - Since the Cassandra cluster is in a degraded state, and the Cassandra pod cc-0 is in a non-ready
state (
- View the initial status of restore job and restore pod
Frequently asked questions
- Where can I find the status of the restore process?
- See: Cassandra restore logging
- In what cases can I re-run the restore process on existing installations?
- The answer depends on which stage of the restore failed. If the restore process failed because of a corrupted tar file, you can re-initiate the restore process using a different backup ID.
- In what cases I need to redeploy a whole new cluster before re-attempting a failed restore?
- If the restore failed due to space allocation problems or any other reasons except corrupted backup tar, a complete new installation is needed. You must fix the problem reported by the restore process.
- What action to take if I get the error message: Not enough space to download the tar file?
- You must re-install the management subsystem, allocating sufficient disk space. Please look at system requirements section for recommended disk sizes.
- What if the downloaded backup tar file is corrupted?
- Use a different backup. Run backup tar integrity checks prior to attempting the restore.
If the backup tar integrity succeeds in your local system but restore process is failing, gather the required logs and contact IBM Support.
- What if there is not enough space to perform restore using the downloaded backup tar file?
- You must re-install the management subsystem, allocating sufficient disk space. See Error: insufficient disk space.
- What if the restore job finishes but I still don't see any restored data?
- See Known limitation: failed restores not reporting properly.
- What if the restore process remains stuck on the health check status for a very long time?
- This might be due to a known limitation. See Example 2: Corrupted backup tar file cc-1 not reporting properly.
- What if the restore process fails on Lur-upgrade-job and Apim-upgrade-job containers?
- Run the restore process again. If the error persists, contact IBM Support.