Troubleshooting MongoDB
Learn how to troubleshoot MongoDB when the MongoDb pods fail to run.
Symptom
MongoDB pod or pods fail to run.
Cause
When the MongoDB pod fails to run, it might be cause by either the storage or data condition problem.
Resolving the problem
First check and fix the storage issues. If the problem persists, check and fix the data condition.
Checking and fixing the storage issues
Check the logs of the failing pod by running the oc describe
command and looking at the Events
section at the end of the log.
Check whether an error is displayed in the pod logs that is similar to the following example:
MountVolume.MountDevice failed for volume "pvc-0b160909-c8fd-4703-b952-4a3bd1a1454f" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000003-99aa25d9-ad2c-11eb-9715-0a580a830621 already exists
Unable to attach or mount volumes: unmounted volumes=[mongodbdir], unattached volumes=[ibm-mongodb-operand-token-rqxpx ca mongodbdir config install keydir tmp-metrics configdir init tmp-mongodb]: timed out waiting for the condition
If you see any of these errors in the log, the storage provider is preventing the PV that is used by the MongoDB pod to either be created or mounted.
Resolving the problem
The issue needs to be fixed on the storage provider's side.
What's next
If the storage issue does not occur, or resolving the storage issues does not help, proceed to checking and fixing data condition.
Checking and fixing data condition
Check whether the data is corrupted or stale. Data corruption might be a result of an unexpected pod restart or networking issue.
Check whether one of the following errors is displayed in the pod logs:
2021-03-15T19:03:34.709+0000 I STORAGE [initandlisten] exception in initAndListen: DBPathInUse: Unable to lock the lock file: /data/db/mongod.lock (Resource temporarily unavailable). Another mongod instance is already running on the /data/db directory, terminating
2021-03-15T19:03:34.709+0000 I NETWORK [initandlisten] shutdown: going to close listening sockets...
2021-03-15T19:03:34.709+0000 I NETWORK [initandlisten] removing socket file: /tmp/mongodb-27017.sock
2021-03-15T19:03:34.709+0000 I CONTROL [initandlisten] now exiting
2021-03-15T19:03:34.709+0000 I CONTROL [initandlisten] shutting down with code:100
2021-03-15T11:05:46.301+0000 E STORAGE [rsBackgroundSync] WiredTiger error (22) [1615806346:301782][37:0x7f682553f700], WT_CONNECTION.set_timestamp: __wt_txn_global_set_timestamp, 452: set_timestamp: oldest timestamp 604b5bd300000005 must not be later than commit timestamp 604b5bc600000003: Invalid argument Raw: [1615806346:301782][37:0x7f682553f700], WT_CONNECTION.set_timestamp: __wt_txn_global_set_timestamp, 452: set_timestamp: oldest timestamp 604b5bd300000005 must not be later than commit timestamp 604b5bc600000003: Invalid argument
2021-03-15T11:05:46.301+0000 F - [rsBackgroundSync] Invariant failure: conn->set_timestamp(conn, commitTSConfigString) resulted in status BadValue: 22: Invalid argument at src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp 1885
2021-03-15T11:05:46.301+0000 F - [rsBackgroundSync]
***aborting after invariant() failure
2021-03-15T11:05:46.313+0000 F - [rsBackgroundSync] Got signal: 6 (Aborted).
0x564340f40211 0x564340f3f429 0x564340f3f90d 0x7f68446b9dd0 0x7f684431c70f 0x7f6844306b25 0x56433f50252c 0x56433f5f39e7 0x56433fc9c38d 0x56433f7fff55 0x56433f805021 0x56433f8a09a0 0x56433f86ca50 0x56433f86e58e 0x56433f8719d7 0x56433f871fee 0x56433f8721ca 0x56434104f610 0x7f68446af2de 0x7f68443e0e83
----- BEGIN BACKTRACE -----
If you see any of these errors in the log, the data is either corrupted or stale. If you have 3 or more ReplicaSet
, this problem can be easily fixed. A pod can become stale if it does not connect to the MongoDB ReplicaSet
for over 24 hours. If a MongoDB pod fails because it is stale, you should see the word stale
in the work-dir/log.txt
in the bootstrap container.
Resolving the problem
If only one MongoDB pod is failing, complete the following steps to resolve the problem.
-
Check whether the primary MongoDB pod is running.
-
Run the following command onto a running MongoDB pod:
oc exec -it icp-mongodb-# -c icp-mongodb -- bash
-
Run the following command in the container to connect to the MongoDB shell:
mongo --host rs0/mongodb:27017 --username $ADMIN_USER --password $ADMIN_PASSWORD --authenticationDatabase admin --ssl --sslCAFile /data/configdb/tls.crt --sslPEMKeyFile /work-dir/mongo.pem
-
When you get a prompt from the MongoBD shell, run the following command:
rs.status()
In the output, check which pod is the current Primary, and how many replicas there are in the
ReplicaSet
. -
Exit the MongoDB shell and the container.
-
If the Primary MongoDB pod is running, go to step 2.
-
(Optional) If the Primary MongoDB pod fails to run, you need to recover all MongoDB pods. For more information, see Restoring all MongoDB pods when the Primary pod fails.
-
-
(Optional) If the failing pod is not currently running, restart the pod by deleting it.
-
Run the following command to get into the failing pod's bootstrap container:
oc exec -it icp-mongodb-# -c bootstrap -- bash
Note: Run the next command immediately as the bootstrap might not be running for long.
-
Use the following command to remove the data in the data base:
rm -rf data/db/*
-
Exit and restart the pod. If a single MongoDB pod was failing, the problem should be resolved.
Restoring all MongoDB pods when the Primary pod fails
If the Primary MongoDB pod fails to run, you need to recover all MongoDB pods. Complete the following steps.
-
Run the following command to scale the operator down to
0
:CSV=$(oc get csv | grep mongodb | cut -d ' ' -f1) oc patch csv $CSV --type='json' -p='[{"op": "replace","path": "/spec/install/spec/deployments/0/spec/replicas","value": 0}]'
-
Get the
statefulset
YAML with the following command:oc get sts icp-mongodb -o yaml > sts.yaml
-
Run the following command to delete
statefulset
:oc delete sts icp-mongodb
-
Open
sts.yaml
and remove the following lines:-
The Creation time stamp
creationTimestamp: "2021-05-05T16:54:19Z" generation: 1
-
The managed fields section with over 400 lines:
managedFields: - apiVersion: apps/v1 fieldsType: FieldsV1 460 more lines... manager: manager operation: Update time: "2021-05-05T16:54:19Z"
Note: By removing the entire section, you remove also the content that is hidden under
460 more lines...
. -
Owner references:
ownerReferences: - apiVersion: operator.ibm.com/v1alpha1 blockOwnerDeletion: true controller: true kind: MongoDB name: ibm-mongodb uid: a314e26d-6b24-4a27-9d3c-d69489c23df6 resourceVersion: "114138" selfLink: /apis/apps/v1/namespaces/<your-foundational-services-namespace>/statefulsets/icp-mongodb uid: 1195a29a-c5e8-4461-8719-14d9f9979f44
- Status at the end of the file:
status: phase: Pending status: collisionCount: 0 currentReplicas: 1 currentRevision: icp-mongodb-567f599f6c observedGeneration: 1 replicas: 1 updateRevision: icp-mongodb-567f599f6c updatedReplicas: 1
-
-
In
sts.yaml
, change thepodManagementPolicy
fromOrderedReady
toParallel
.podManagementPolicy: Parallel
-
Run the following command to apply the statefulset:
oc apply -f sts.yaml
-
At this point, all three MongoDB pods should be running. However, one of the
icp-mongodb
pods can still be down. If one of the pods is down, but the other two MongoDB pods are recovered, resolve the data condition issue to recover the remaining pod. -
Once you recover the remaining pod, restore the Primary MongoDB pod and go back to using the operator. Complete the steps in Restoring the Primary MongoDB pod.
Restoring the Primary MongoDB pod
-
Check which pod is currently the Primary:
-
Run the following command onto a running MongoDB pod:
oc exec -it icp-mongodb-# -c icp-mongodb -- bash
-
Run the following command in the container to connect to the MongoDB shell:
mongo --host rs0/mongodb:27017 --username $ADMIN_USER --password $ADMIN_PASSWORD --authenticationDatabase admin --ssl --sslCAFile /data/configdb/tls.crt --sslPEMKeyFile /work-dir/mongo.pem
-
When you get a prompt from the MongoBD shell, run the following command:
rs.status()
In the output, check which pod is the current Primary.
-
-
If Primary is not on
icp-mongodb-0
, delete the pod that is Primary instead. When the prompt returns count to 5 and delete the same pod again. Repeat 3 times. By repeating the steps, the pod gets offline for enough time for the other two pods to run the election and move the Primary. -
Check if
icp-mongodb-0
becomes the Primary by repeating step 1. If not, repeat step 2. -
Once the Primary is back on
icp-mongodb-0
, run the following command to scale the statefulset down to 0.oc scale --replicas=0 sts/icp-mongodb
-
Delete the statefulset with the following command:
oc delete sts icp-mongodb
-
To restore the operator, run the following command:
CSV=$(oc get csv | grep mongodb | cut -d ' ' -f1) oc patch csv $CSV --type='json' -p='[{"op": "replace","path": "/spec/install/spec/deployments/0/spec/replicas","value": 1}]'