IBM Support

Mongodb in Common Services stuck

Troubleshooting


Problem

Customer was unable to access the Cloud Pak for Security login page

Symptom

The customer got a spinning wheel on the login page.

Cause

Mongodb in Common Services was stuck, only the Secondary pod came up:

oc get pods | grep -i mongo

icp-mongodb-0 2/2 Running

icp-mongodb-1 0/2 Init:1/2

icp-mongodb-2 0/2 Init:1/2

As a result, one of the auth pods was crashlooping.

Possible root cause

We noticed on the call that the OCP nodes were being shut down (to save money after working hours) without first scaling down the cp4s application, possibly causing the database corruption.

Environment

The customer environment was in ROSA, but it could have happened in any OCP environment.

Diagnosing The Problem

Check the status of the working mongodb pod

oc exec icp-mongodb-0 -c icp-mongodb -- /bin/bash -c 'mongo --username $ADMIN_USER --password $ADMIN_PASSWORD --authenticationDatabaseadmin --ssl --sslCAFile/data/configdb/tls.crt --sslPEMKeyFile/work-dir/mongo.pemplatform-db--eval "rs.status()"' | grep stateStr

“StateStr” : “SECONDARY”

“StateStr” : “(not reachable/healthy)”,

“StateStr” : “(not reachable/healthy)”,

The bootstrap log on icp-mongodb-1 is showing a fatal recovery error:

oc exec icp-mongodb-1 -c bootstrap -n ibm-common-services -- tail -100 work-dir/log.txt

.. Fatal Assertion 40313 at src/mongo/db/repl/replication_recovery.cpp 350

so it is better to destroy, so we can copy from the primary one, assuming we can trust mongodb-0 to have the full data.

Alternatively, we could have simply removed the corrupted data through the failing pod's bootstrap container.

Resolving The Problem

First scale down the stateful set

oc scale --replicas=1 sts/icp-mongodb

then delete the pvc’s associated with mongodb-1 and mongodb-2

oc get pvc | grep mongo

mongodbdir-icp-mongodb-0 Bound

mongodbdir-icp-mongodb-1 Bound

mongodbdir-icp-mongodb-2 Bound

oc delete pvc  mongodbdir-icp-mongodb-1 mongodbdir-icp-mongodb-2

If it was mongodb-1 was the one running, we would need to change the startup order to PARALLEL

Then scale up the mongodb operator which previously had been scaled down to 0 with

CSV=$(oc get csv | grep mongodb | cut -d ' ' -f1) oc patch csv $CSV --type='json' -p='[{"op": "replace","path": "/spec/install/spec/deployments/0/spec/replicas","value": 0}]'

back to 1 

CSV=$(oc get csv | grep mongodb | cut -d ' ' -f1)

oc patch csv $CSV --type='json' -p='[{"op": "replace","path": "/spec/install/spec/deployments/0/spec/replicas","value": 1}]’ 

oc scale --replicas=3 sts/icp-mongodb

It will take some time for the operator to create the pods 1 and 2

oc get pods | grep -i mongo

icp-mongodb-0 2/2 Running

icp-mongodb-1 0/2 Init:1/2

icp-mongodb-2 0/2 Init:1/2

oc get pods | grep -i mongo

icp-mongodb-0 2/2 Running

icp-mongodb-1 2/2 Running

icp-mongodb-2 2/2 Running

oc exec icp-mongodb-0 -c icp-mongodb -- /bin/bash -c 'mongo --username $ADMIN_USER --password $ADMIN_PASSWORD --authenticationDatabaseadmin --ssl --sslCAFile/data/configdb/tls.crt --sslPEMKeyFile/work-dir/mongo.pemplatform-db--eval "rs.status()"' | grep stateStr

“StateStr” : “PRIMARY”,

“StateStr” : “SECONDARY”,

“StateStr” : “SECONDARY”,

That looks good. Now check the auth pods

oc get pods | grep -i auth

auth-idp-.. 4/4 Running

auth-pap-.. 2/2 Running

auth-pdp-.. 2/2 Running

After this, the customer was able to login to the console again.

Alternative solution

When you found that the bootstrap log on icp-mongodb-1 is showing a fatal recovery error, you can recover the MongoDB cluster by  simply removing the corrupted data through the failing pod's bootstrap container.

oc exec icp-mongodb-1 -c bootstrap -n ibm-common-services -- tail -100 work-dir/log.txt 
.. Fatal Assertion 40313 at src/mongo/db/repl/replication_recovery.cpp 350

You can simply remove the corrupted data through the failing pod's bootstrap container

oc exec icp-mongodb-1 -c bootstrap -- rm -rf data/db/* 
oc delete pod icp-mongodb-1

Wait for 5 minutes and ensure that icp-mongodb-1 becomes Running status.
Do this for icp-mongodb-2 pod as well, remove the corrupted data through the failing pod's bootstrap container.
oc exec icp-mongodb-2 -c bootstrap -- rm -rf data/db/*
oc delete pod icp-mongodb-2
Wait for 5 minutes and ensure that icp-mongodb-2 becomes Running status.

Finally, verify if there is one PRIMARY and two SECONDARY members in the MongDB cluster:

oc exec icp-mongodb-0 -c icp-mongodb -- /bin/bash -c 'mongo --username $ADMIN_USER --password $ADMIN_PASSWORD --authenticationDatabaseadmin --ssl --sslCAFile/data/configdb/tls.crt --sslPEMKeyFile/work-dir/mongo.pemplatform-db--eval "rs.status()"' | grep stateStr
"StateStr" : "PRIMARY",
"StateStr" : "SECONDARY",
"StateStr" : "SECONDARY",

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB77","label":"Automation Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSTDPP","label":"IBM Cloud Pak for Security"},"ARM Category":[{"code":"a8m3p000000F8yvAAC","label":"Cloud Pak for Security (CP4S)"},{"code":"a8m3p000000F8zAAAS","label":"Cloud Pak for Security (CP4S)-\u003EAuthentication"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.10.0"}]

Product Synonym

Cloud Pak for Security;CP4S;QRadar Suite

Document Information

Modified date:
28 February 2025

UID

ibm17184499