Guardium Data Security Center v3.4.x - Db2 pod stuck in pending state

Occasionally a db2u pod may become stuck in the Pending state because the node labeled for the db2u pod does not have enough resources to run the pod. This troubleshooting topic explains how to resolve this issue.

Symptoms

Version 3.4.x and later This content only applies to Guardium Data Security Center Version 3.4.x and later.

During the installation of Guardium Data Security Center, you verify the creation of your Guardium Data Security Center instance. The expected output of this is a Completed Reconciliation message. If, instead, the message remains at Pending, Db2 may be stuck.

Causes

The node labeled for the db2u pod does not have enough resources to run the pod.

Diagnosing the problem

When this problem occurs, the first step is to get a list of all nodes labeled for Db2 by running this command:

oc get nodes -licp4data

The expected output is similar to:

NAME            STATUS   ROLES           AGE   VERSION
10.166.204.57   Ready    master,worker   37d   v1.19.0+1cec875
10.166.204.59   Ready    master,worker   37d   v1.19.0+1cec875
Note: This command lists only nodes labeled for Db2.

Next, you can determine out which Db2 pod has the resource issue by running this command:

oc get po -owide | grep db2u

The expected output is similar to:

c-sysqa-db2-db2u-0     1/1     Running     0     3h22m   172.30.111.6     10.166.204.59   <none>           <none>
c-sysqa-db2-db2u-1     0/1     Pending     0     3h22m                                    <none>           <none>

In this sample output, you can see that the c-sysqa-db2-db2u-1 pod is stuck in the Pending state and it is the pod that you need to free up resources for.

You will also determine which node the running pod is assigned to. From the above example, you can see that the 10.166.204.57 node is the node that should hold the second Db2 pod.

Resolving the problem

In the above example, you determined that pods in 10.166.204.57 can be removed to free up resources. Issue this command:

oc get po -A -owide | grep 10.166.204.57  | grep -v Completed

The expected output is similar to:

sysqa                                              audit-logging-fluentd-ds-hfnr5                                    1/1     Running                      0          22h     172.30.151.153   10.166.204.57   <none>           <none>
sysqa                                              auth-idp-584b98776f-l86bf                                         4/4     Running                      0          3h23m   172.30.151.133   10.166.204.57   <none>           <none>
sysqa                                              cert-manager-controller-57595b9fb7-784pv                          0/1     Running                      0          17s     172.30.151.159   10.166.204.57   <none>           <none>
sysqa                                              db2u-operator-manager-cd64f96bf-rs7zm                             0/1     Running                      0          17s     172.30.151.135   10.166.204.57   <none>           <none>
sysqa                                              ibm-cert-manager-operator-7b5b5c89d8-n4tnp                        1/1     Running                      0          17s     172.30.151.131   10.166.204.57   <none>           <none>
sysqa                                              ibm-common-service-operator-86c795d9bd-5dbnw                      0/1     Running                      0          17s     172.30.151.129   10.166.204.57   <none>           <none>
sysqa                                              ibm-common-service-webhook-699b6cf98b-vxd68                       1/1     Running                      0          17s     172.30.151.164   10.166.204.57   <none>           <none>
sysqa                                              ibm-events-operator-v3.10.0-8575477cf5-gd92c                      0/1     Running                      0          17s     172.30.151.132   10.166.204.57   <none>           <none>
sysqa                                              ibm-healthcheck-operator-5c4d597-dqzrt                            1/1     Running                      0          17s     172.30.151.141   10.166.204.57   <none>           <none>
sysqa                                              ibm-iam-operator-798774d4f8-kdrbh                                 0/1     ContainerCreating            0          17s     <none>           10.166.204.57   <none>           <none>
sysqa                                              ibm-monitoring-grafana-5cdcc65d85-pdcpv                           4/4     Terminating                  0          3h24m   172.30.151.168   10.166.204.57   <none>           <none>
sysqa                                              ibm-monitoring-grafana-5cdcc65d85-rqqdr                           0/4     PodInitializing              0          17s     172.30.151.150   10.166.204.57   <none>           <none>
sysqa                                              icp-mongodb-2                                                     0/2     Init:0/2                     0          3s      <none>           10.166.204.57   <none>           <none>
calico-system                                      calico-node-9p4xk                                                 1/1     Running                      0          21d     10.166.204.57    10.166.204.57   <none>           <none>
calico-system                                      calico-typha-647686bb4c-p8k7v                                     1/1     Running                      0          13d     10.166.204.57    10.166.204.57   <none>           <none>
sysqa                                              audit-logging-fluentd-ds-kfzgz                                    1/1     Running                      0          2m42s   172.30.151.140   10.166.204.57   <none>           <none>
ibm-system                                         ibm-cloud-provider-ip-169-55-180-58-5475776686-2bl4r              1/1     Running                      0          3h26m   10.166.204.57    10.166.204.57   <none>           <none>
kube-system                                        ibm-keepalived-watcher-7tqdp                                      1/1     Running                      0          21d     10.166.204.57    10.166.204.57   <none>           <none>
kube-system                                        ibm-master-proxy-static-10.166.204.57                             2/2     Running                      0          13d     10.166.204.57    10.166.204.57   <none>           <none>
kube-system                                        ibmcloud-block-storage-driver-9hzjm                               1/1     Running                      0          21d     10.166.204.57    10.166.204.57   <none>           <none>
kube-system                                        norootsquash-msvwq                                                1/1     Running                      0          36d     172.30.151.174   10.166.204.57   <none>           <none>
openshift-cluster-node-tuning-operator             tuned-hw6jp                                                       1/1     Running                      0          21d     10.166.204.57    10.166.204.57   <none>           <none>
openshift-dns                                      dns-default-rd57f                                                 3/3     Running                      0          21d     172.30.151.179   10.166.204.57   <none>           <none>
openshift-image-registry                           node-ca-hz2n8                                                     1/1     Running                      0          21d     10.166.204.57    10.166.204.57   <none>           <none>
openshift-kube-proxy                               openshift-kube-proxy-29q8p                                        1/1     Running                      0          21d     10.166.204.57    10.166.204.57   <none>           <none>
openshift-marketplace                              community-operators-9qgc6                                         1/1     Running                      0          149m    172.30.151.186   10.166.204.57   <none>           <none>
openshift-marketplace                              ibm-operator-catalog-p95s4                                        0/1     ImagePullBackOff             0          27h     172.30.151.175   10.166.204.57   <none>           <none>
openshift-monitoring                               node-exporter-glwrz                                               2/2     Running                      0          21d     10.166.204.57    10.166.204.57   <none>           <none>
openshift-multus                                   multus-7brdx                                                      1/1     Running                      0          21d     10.166.204.57    10.166.204.57   <none>           <none>
openshift-multus                                   multus-admission-controller-6pz5j                                 2/2     Running                      0          21d     172.30.151.178   10.166.204.57   <none>           <none>
openshift-multus                                   network-metrics-daemon-8nplx                                      2/2     Running                      0          21d     172.30.151.184   10.166.204.57   <none>           <none>
sysqa                                              c-sysqa-db2-db2u-1                                                1/1     Running                      0          3h23m   172.30.151.156   10.166.204.57   <none>           <none>
sysqa                                              c-sysqa-db2-restore-morph-tx7mf                                   1/1     Running                      0          2m12s   172.30.151.155   10.166.204.57   <none>           <none>
sysqa                                              sysqa-entity-operator-57b8448f7c-wzrfp                            3/3     Running                      0          3h28m   172.30.151.170   10.166.204.57   <none>           <none>

You can safely remove the pods returned by this output:

oc get pod -o=jsonpath='{.items[?(@.metadata.annotations.productName=="IBM Cloud Platform Common Services")].metadata.name}'

For example:

oc get pod -o=jsonpath='{.items[?(@.metadata.annotations.productName=="IBM Cloud Platform Common Services")].metadata.name}' 
common-web-ui-969465588-qspzl ibm-common-service-operator-5c65546bfc-rsvt7 
   ibm-commonui-operator-7ddd78d576-j9rp2 ibm-events-operator-v4.9.0-7cdcd87689-42z5l ibm-iam-operator-6f59cf9d4-7vdz4 
   ibm-mongodb-operator-8d4dc7588-cf7v8 icp-mongodb-0 icp-mongodb-1 icp-mongodb-2 operand-deployment-lifecycle-manager-748bfc7ccc-sd7th 
   platform-auth-service-7bbff54f9d-v92b8 platform-identity-management-64d8ccd96b-f7n2g platform-identity-provider-c66999b5b-9g5h6
  1. Before removing the pods, taint the node so that the pods are not rescheduled on it:
    oc adm taint node 10.166.204.57 icp4data=database-db2wh:NoSchedule 
  2. Next you will remove the pods by running this sample command:
    oc delete po `oc get po -A -owide | grep 10.166.204.57  | grep -v Completed | grep ibm-common-services | awk '{print $2}'` -n=ibm-common-services

There should now be enough room on the node for the db2u pod get scheduled.

You can also check the resources by running this command:

oc describe node 10.166.204.57 

The expected results are similar to:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests         Limits
  --------           --------         ------
  cpu                3317m (20%)      7890m (49%)
  memory             7007763Ki (11%)  7252256Ki (12%)
  ephemeral-storage  1Gi (1%)         10Gi (11%)
Events:              <none>

You can then check available resources on the node in the Allocated resources section and compare against the Db2 requirements to confirm that there are enough resources on the node for the db2u pod to run.

Finally, you can verify that the db2u pod is up by running this command:

oc get po -owide | grep db2u

The expected output is similar to:

c-sysqa-db2-db2u-0     1/1     Running     0          3h22m   172.30.111.6     10.166.204.59   <none>           <none>
c-sysqa-db2-db2u-1     1/1     Running     0          3h22m   172.30.151.156   10.166.204.57   <none>           <none>

If the pod is running, you can remove the taint on the node:

oc adm taint node 10.166.204.57 icp4data=database-db2wh:NoSchedule-