Resolving common issues of Cloud Pak for Data on Cloud Pak for Data Systems

Troubleshooting

Problem

This technote covers 3 common issues:

1) UI keeps loading (spinning circle)

2) Unbalanced distribution of pods across the cluster nodes.

3) Pods from network-diag namespaces are in ImagePullBackOff or ErrImagePull state

Symptom

Slow UI navigation and 404 errors can be seen in browsers Console and Network tab

this is often caused by invalid image being pulled by the zen-core deployment.

Please check the output of the following command:

  oc describe po zen-core | grep Image | grep zen-core@

the expected output is to see the same Image ID listed 3 times, like so:

      Image ID:      docker-pullable://docker-registry.default.svc:5000/test/zen-core@sha256:a136d6fc976790aafb07d9edf25aef4e85c90724c4212f92b7fa2911a1c7aef8      Image ID:      docker-pullable://docker-registry.default.svc:5000/test/zen-core@sha256:a136d6fc976790aafb07d9edf25aef4e85c90724c4212f92b7fa2911a1c7aef8      Image ID:      docker-pullable://docker-registry.default.svc:5000/test/zen-core@sha256:a136d6fc976790aafb07d9edf25aef4e85c90724c4212f92b7fa2911a1c7aef8

if the IDs differ, please follow the steps in 'Resolving The Problem'.

This can be observed when at least a couple of Cloud Pak for Data services are deployed and becomes more significant as the number of pods increases.

To determine if you are affected by this, run:

  oc describe node | sed -n -e '/Allocated/,/Events/p' -e 's/Name:/\n---:>&/p'

and

  oc get po --all-namespaces -o wide --no-headers | grep -v Comple | awk '{print $(NF-1)}' | sort | uniq -c

Please note if one node (in most cases a worker) is significantly more utilized than the rest. For example:

        22 e1n1-1-control.fbond         22 e1n2-1-control.fbond         22 e1n3-1-control.fbond        125 e1n4-1-worker.fbond         44 e2n1-1-worker.fbond         55 e2n2-1-worker.fbond         49 e2n3-1-worker.fbond

Such heavy utilization of the node can lead to many negative implications, for example:

https://www.ibm.com/support/pages/node/6098818

Running:

  oc get po --no-headers --all-namespaces -o wide| grep -Ev '([[:digit:]])/\1.*R' | grep -v 'Completed'

shows pods in ImagePullBackOff/ErrImagePull state like:

Resolving The Problem

We are still working to complete resolve this issue in our repository, currently the best solution is to update the ImagePullPolicy of zen-core deployment from IfNotPresent to Always.

  oc patch deploy zen-core -n zen -p '{"spec": {"template": {"spec":{"containers":[{"name": "zen-core-container", "imagePullPolicy":"Always"}]}}}}'

To even the distribution, and reduce the workload on the workers, please consider the following steps:

a) Modify following parameters on each master node /etc/origin/master/yosemite-appmgnt-scheduler.json

  MasterLastPriority - weight:  20  SelectorSpreadPriority - weight: 10  LeastRequestedPriority - weight: 10  BalancedResourceAllocation - weight: 5  Zone - weight: 20

Run the following commands after changes on each master node:

       master-restart api       master-restart controllers

This should have positive impact on the default scheduler behavior (more balances workload distribution).

b) Estimate the optimal number of max-pods parameter based on the current workload.

Please sum up all the pods on the workers (in our example: 125+44+55+49 = 273), then multiply it with 1.1 factor (for example: 273*1.1 = 300), finally divide it by the number of workers minus one (300/ (4-1) = 100). This number is just an estimation, if you are planning to add more services in the near future, you may want to increase it.

c) Modify the worker node config-map as per https://docs.openshift.com/container-platform/3.11/admin_guide/manage_nodes.html#modifying-nodes

       oc edit cm node-config-compute-crio -n openshift-node

add the following entries under kubeletArguments (make sure indentations are correct):

    max-pods:    - "<result of step b>"     kube-reserved:    - "cpu=500m,memory=512Mi"    system-reserved:    - "cpu=500m,memory=512Mi"

for example:

Upon this action, you should see many pods in OutOfpods state being migrated to less busy worker nodes.

Run the following script to update the images with valid tags:

  for node in $(oc get nodes | awk '/ Ready/ {print $1}' )   do     echo $node     ssh $node podman tag registry.access.redhat.com/openshift3/ose-deployer:v3.11.154 registry.redhat.io/openshift3/ose-deployer:v3.11     ssh $node podman tag registry.access.redhat.com/openshift3/ose-control-plane:v3.11.154 registry.access.redhat.com/openshift3/ose-control-plane:v3.11     done

please note that this step contains values specific to the OpenShift version (v3.11.154 in this case).

Please check the output of oc version or podman images to verify the image versions.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHGYS","label":"IBM Cloud Pak for Data"},"ARM Category":[{"code":"a8m3p000000UoRoAAK","label":"Console"}],"ARM Case Number":"","Platform":[{"code":"PF040","label":"Red Hat OpenShift"}],"Version":"All Versions"},{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHDA9","label":"IBM Cloud Pak for Data System"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Tips

Resolving common issues of Cloud Pak for Data on Cloud Pak for Data Systems

Troubleshooting

Problem

Symptom

Resolving The Problem

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?