Troubleshooting
Problem
Customer was unable to access the Cloud Pak for Security login page
Symptom
The customer got a spinning wheel on the login page.
Cause
Mongodb in Common Services was stuck, only the Secondary pod came up:
oc get pods | grep -i mongo
icp-mongodb-0 2/2 Running
icp-mongodb-1 0/2 Init:1/2
icp-mongodb-2 0/2 Init:1/2
As a result, one of the auth pods was crashlooping.
Possible root cause
We noticed on the call that the OCP nodes were being shut down (to save money after working hours) without first scaling down the cp4s application, possibly causing the database corruption.
Environment
Diagnosing The Problem
oc exec icp-mongodb-0 -c icp-mongodb -- /bin/bash -c 'mongo --username $ADMIN_USER --password $ADMIN_PASSWORD --authenticationDatabaseadmin --ssl --sslCAFile/data/configdb/tls.crt --sslPEMKeyFile/work-dir/mongo.pemplatform-db--eval "rs.status()"' | grep stateStr
“StateStr” : “SECONDARY”
“StateStr” : “(not reachable/healthy)”,
“StateStr” : “(not reachable/healthy)”,
The bootstrap log on icp-mongodb-1 is showing a fatal recovery error:
oc exec icp-mongodb-1 -c bootstrap -n ibm-common-services -- tail -100 work-dir/log.txt
.. Fatal Assertion 40313 at src/mongo/db/repl/replication_recovery.cpp 350
so it is better to destroy, so we can copy from the primary one, assuming we can trust mongodb-0 to have the full data.
Alternatively, we could have simply removed the corrupted data through the failing pod's bootstrap container.
Resolving The Problem
First scale down the stateful set
oc scale --replicas=1 sts/icp-mongodb
then delete the pvc’s associated with mongodb-1 and mongodb-2
oc get pvc | grep mongo
mongodbdir-icp-mongodb-0 Bound
mongodbdir-icp-mongodb-1 Bound
mongodbdir-icp-mongodb-2 Bound
oc delete pvc mongodbdir-icp-mongodb-1 mongodbdir-icp-mongodb-2
If it was mongodb-1 was the one running, we would need to change the startup order to PARALLEL
Then scale up the mongodb operator which previously had been scaled down to 0 with
CSV=$(oc get csv | grep mongodb | cut -d ' ' -f1) oc patch csv $CSV --type='json' -p='[{"op": "replace","path": "/spec/install/spec/deployments/0/spec/replicas","value": 0}]'
back to 1
CSV=$(oc get csv | grep mongodb | cut -d ' ' -f1)
oc patch csv $CSV --type='json' -p='[{"op": "replace","path": "/spec/install/spec/deployments/0/spec/replicas","value": 1}]’
oc scale --replicas=3 sts/icp-mongodb
It will take some time for the operator to create the pods 1 and 2
oc get pods | grep -i mongo
icp-mongodb-0 2/2 Running
icp-mongodb-1 0/2 Init:1/2
icp-mongodb-2 0/2 Init:1/2
oc get pods | grep -i mongo
icp-mongodb-0 2/2 Running
icp-mongodb-1 2/2 Running
icp-mongodb-2 2/2 Running
oc exec icp-mongodb-0 -c icp-mongodb -- /bin/bash -c 'mongo --username $ADMIN_USER --password $ADMIN_PASSWORD --authenticationDatabaseadmin --ssl --sslCAFile/data/configdb/tls.crt --sslPEMKeyFile/work-dir/mongo.pemplatform-db--eval "rs.status()"' | grep stateStr
“StateStr” : “PRIMARY”,
“StateStr” : “SECONDARY”,
“StateStr” : “SECONDARY”,
That looks good. Now check the auth pods
oc get pods | grep -i auth
auth-idp-.. 4/4 Running
auth-pap-.. 2/2 Running
auth-pdp-.. 2/2 Running
After this, the customer was able to login to the console again.
Alternative solution
oc exec icp-mongodb-1 -c bootstrap -n ibm-common-services -- tail -100 work-dir/log.txt .. Fatal Assertion 40313 at src/mongo/db/repl/replication_recovery.cpp 350 You can simply remove the corrupted data through the failing pod's bootstrap container oc exec icp-mongodb-1 -c bootstrap -- rm -rf data/db/* oc delete pod icp-mongodb-1 Wait for 5 minutes and ensure that icp-mongodb-1 becomes Running status. Do this for icp-mongodb-2 pod as well, remove the corrupted data through the failing pod's bootstrap container.
Finally, verify if there is one PRIMARY and two SECONDARY members in the MongDB cluster:
Document Location
Worldwide
Product Synonym
Cloud Pak for Security;CP4S;QRadar Suite
Was this topic helpful?
Document Information
Modified date:
28 February 2025
UID
ibm17184499