Guardium Data Security Center v3.4.x - Db2 pod stuck in pending state
Occasionally a db2u
pod may become stuck in the Pending
state because the node labeled for the db2u
pod does not have enough resources to
run the pod. This troubleshooting topic explains how to resolve this issue.
Symptoms
Version 3.4.x and later This content only applies to Guardium Data Security Center Version 3.4.x and later.
During the installation of Guardium Data Security Center, you verify the
creation of your Guardium Data Security Center instance. The
expected output of this is a Completed Reconciliation
message. If, instead, the
message remains at Pending
, Db2 may be
stuck.
Causes
The node labeled for thedb2u
pod does not have enough resources to run
the pod.Diagnosing the problem
When this problem occurs, the first step is to get a list of all nodes labeled for Db2 by running this command:
oc get nodes -licp4data
The expected output is similar to:
NAME STATUS ROLES AGE VERSION
10.166.204.57 Ready master,worker 37d v1.19.0+1cec875
10.166.204.59 Ready master,worker 37d v1.19.0+1cec875
Next, you can determine out which Db2 pod has the resource issue by running this command:
oc get po -owide | grep db2u
The expected output is similar to:
c-sysqa-db2-db2u-0 1/1 Running 0 3h22m 172.30.111.6 10.166.204.59 <none> <none>
c-sysqa-db2-db2u-1 0/1 Pending 0 3h22m <none> <none>
In this sample output, you can see that the c-sysqa-db2-db2u-1
pod is stuck in
the Pending
state and it is the pod that you need to free up resources for.
You will also determine which node the running pod is assigned to. From the above example, you
can see that the 10.166.204.57
node is the node that should hold the second
Db2 pod.
Resolving the problem
In the above example, you determined that pods in 10.166.204.57
can be removed
to free up resources. Issue this command:
oc get po -A -owide | grep 10.166.204.57 | grep -v Completed
The expected output is similar to:
sysqa audit-logging-fluentd-ds-hfnr5 1/1 Running 0 22h 172.30.151.153 10.166.204.57 <none> <none>
sysqa auth-idp-584b98776f-l86bf 4/4 Running 0 3h23m 172.30.151.133 10.166.204.57 <none> <none>
sysqa cert-manager-controller-57595b9fb7-784pv 0/1 Running 0 17s 172.30.151.159 10.166.204.57 <none> <none>
sysqa db2u-operator-manager-cd64f96bf-rs7zm 0/1 Running 0 17s 172.30.151.135 10.166.204.57 <none> <none>
sysqa ibm-cert-manager-operator-7b5b5c89d8-n4tnp 1/1 Running 0 17s 172.30.151.131 10.166.204.57 <none> <none>
sysqa ibm-common-service-operator-86c795d9bd-5dbnw 0/1 Running 0 17s 172.30.151.129 10.166.204.57 <none> <none>
sysqa ibm-common-service-webhook-699b6cf98b-vxd68 1/1 Running 0 17s 172.30.151.164 10.166.204.57 <none> <none>
sysqa ibm-events-operator-v3.10.0-8575477cf5-gd92c 0/1 Running 0 17s 172.30.151.132 10.166.204.57 <none> <none>
sysqa ibm-healthcheck-operator-5c4d597-dqzrt 1/1 Running 0 17s 172.30.151.141 10.166.204.57 <none> <none>
sysqa ibm-iam-operator-798774d4f8-kdrbh 0/1 ContainerCreating 0 17s <none> 10.166.204.57 <none> <none>
sysqa ibm-monitoring-grafana-5cdcc65d85-pdcpv 4/4 Terminating 0 3h24m 172.30.151.168 10.166.204.57 <none> <none>
sysqa ibm-monitoring-grafana-5cdcc65d85-rqqdr 0/4 PodInitializing 0 17s 172.30.151.150 10.166.204.57 <none> <none>
sysqa icp-mongodb-2 0/2 Init:0/2 0 3s <none> 10.166.204.57 <none> <none>
calico-system calico-node-9p4xk 1/1 Running 0 21d 10.166.204.57 10.166.204.57 <none> <none>
calico-system calico-typha-647686bb4c-p8k7v 1/1 Running 0 13d 10.166.204.57 10.166.204.57 <none> <none>
sysqa audit-logging-fluentd-ds-kfzgz 1/1 Running 0 2m42s 172.30.151.140 10.166.204.57 <none> <none>
ibm-system ibm-cloud-provider-ip-169-55-180-58-5475776686-2bl4r 1/1 Running 0 3h26m 10.166.204.57 10.166.204.57 <none> <none>
kube-system ibm-keepalived-watcher-7tqdp 1/1 Running 0 21d 10.166.204.57 10.166.204.57 <none> <none>
kube-system ibm-master-proxy-static-10.166.204.57 2/2 Running 0 13d 10.166.204.57 10.166.204.57 <none> <none>
kube-system ibmcloud-block-storage-driver-9hzjm 1/1 Running 0 21d 10.166.204.57 10.166.204.57 <none> <none>
kube-system norootsquash-msvwq 1/1 Running 0 36d 172.30.151.174 10.166.204.57 <none> <none>
openshift-cluster-node-tuning-operator tuned-hw6jp 1/1 Running 0 21d 10.166.204.57 10.166.204.57 <none> <none>
openshift-dns dns-default-rd57f 3/3 Running 0 21d 172.30.151.179 10.166.204.57 <none> <none>
openshift-image-registry node-ca-hz2n8 1/1 Running 0 21d 10.166.204.57 10.166.204.57 <none> <none>
openshift-kube-proxy openshift-kube-proxy-29q8p 1/1 Running 0 21d 10.166.204.57 10.166.204.57 <none> <none>
openshift-marketplace community-operators-9qgc6 1/1 Running 0 149m 172.30.151.186 10.166.204.57 <none> <none>
openshift-marketplace ibm-operator-catalog-p95s4 0/1 ImagePullBackOff 0 27h 172.30.151.175 10.166.204.57 <none> <none>
openshift-monitoring node-exporter-glwrz 2/2 Running 0 21d 10.166.204.57 10.166.204.57 <none> <none>
openshift-multus multus-7brdx 1/1 Running 0 21d 10.166.204.57 10.166.204.57 <none> <none>
openshift-multus multus-admission-controller-6pz5j 2/2 Running 0 21d 172.30.151.178 10.166.204.57 <none> <none>
openshift-multus network-metrics-daemon-8nplx 2/2 Running 0 21d 172.30.151.184 10.166.204.57 <none> <none>
sysqa c-sysqa-db2-db2u-1 1/1 Running 0 3h23m 172.30.151.156 10.166.204.57 <none> <none>
sysqa c-sysqa-db2-restore-morph-tx7mf 1/1 Running 0 2m12s 172.30.151.155 10.166.204.57 <none> <none>
sysqa sysqa-entity-operator-57b8448f7c-wzrfp 3/3 Running 0 3h28m 172.30.151.170 10.166.204.57 <none> <none>
You can safely remove the pods returned by this output:
oc get pod -o=jsonpath='{.items[?(@.metadata.annotations.productName=="IBM Cloud Platform Common Services")].metadata.name}'
For example:
oc get pod -o=jsonpath='{.items[?(@.metadata.annotations.productName=="IBM Cloud Platform Common Services")].metadata.name}'
common-web-ui-969465588-qspzl ibm-common-service-operator-5c65546bfc-rsvt7
ibm-commonui-operator-7ddd78d576-j9rp2 ibm-events-operator-v4.9.0-7cdcd87689-42z5l ibm-iam-operator-6f59cf9d4-7vdz4
ibm-mongodb-operator-8d4dc7588-cf7v8 icp-mongodb-0 icp-mongodb-1 icp-mongodb-2 operand-deployment-lifecycle-manager-748bfc7ccc-sd7th
platform-auth-service-7bbff54f9d-v92b8 platform-identity-management-64d8ccd96b-f7n2g platform-identity-provider-c66999b5b-9g5h6
- Before removing the pods, taint the node so that the pods are not rescheduled on
it:
oc adm taint node 10.166.204.57 icp4data=database-db2wh:NoSchedule
- Next you will remove the pods by running this sample
command:
oc delete po `oc get po -A -owide | grep 10.166.204.57 | grep -v Completed | grep ibm-common-services | awk '{print $2}'` -n=ibm-common-services
There should now be enough room on the node for the db2u
pod get scheduled.
You can also check the resources by running this command:
oc describe node 10.166.204.57
The expected results are similar to:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3317m (20%) 7890m (49%)
memory 7007763Ki (11%) 7252256Ki (12%)
ephemeral-storage 1Gi (1%) 10Gi (11%)
Events: <none>
You can then check available resources on the node in the Allocated resources
section and compare against the Db2 requirements to
confirm that there are enough resources on the node for the db2u
pod to run.
Finally, you can verify that the db2u
pod is up by running this command:
oc get po -owide | grep db2u
The expected output is similar to:
c-sysqa-db2-db2u-0 1/1 Running 0 3h22m 172.30.111.6 10.166.204.59 <none> <none>
c-sysqa-db2-db2u-1 1/1 Running 0 3h22m 172.30.151.156 10.166.204.57 <none> <none>
If the pod is running, you can remove the taint on the node:
oc adm taint node 10.166.204.57 icp4data=database-db2wh:NoSchedule-