Known issues and limitations

Review the known issues for version 3.2.1.

Installation, configuration, and upgrade
Security and compliance
Network
Catalog
Storage
Monitoring and logging
Platform management
Cluster management
Management console
IBM Cloud Private CLI (cloudctl)
Known limitations of IBM Cloud Private on Linux on IBM Z and LinuxONE
Cannot deploy more than one instance of IBM Cloud Private Certificate Manager

Installation, configuration, and upgrade

IBM Multicloud Manager endpoint pods not running properly during fresh installation

When performing a fresh installation of fix pack version 3.2.1.2203 or 3.2.2.2203 with IBM Multicloud Manager enabled, the following pods might not start properly:

endpoint-svcreg-coredns-74dcccd969-m88xs              0/1     ImagePullBackOff   0          8m7s
endpoint-topology-weave-scope-app-54d76cf657-th9m5    1/2     ImagePullBackOff   0          8m9s
endpoint-topology-weave-scope-app-6dd54dbc64-65pj5    1/2     ImagePullBackOff   0          8m

Workaround:

Go to the node that the pod is assigned to, and retag the following images:
docker tag mycluster.icp:8500/ibmcom/coredns-amd64:1.9.1 mycluster.icp:8500/ibmcom/coredns-amd64:1.2.6.1
docker tag mycluster.icp:8500/ibmcom/icp-management-ingress-amd64:2.4.0.2203 mycluster.icp:8500/ibmcom/icp-management-ingress-amd64:2.4.0.2105

Adding node to cluster fails

Adding node to cluster with the following command fails:

# Login to Prod master node louapplps228 via putty or given ssh tool
# Become root
sudo su -
# Install node (on louapplps228)
cd /opt/icp310/cluster

# install single node(s)
docker run -e LICENSE=accept --net=host -v "$(pwd)":/installer/cluster ibmcom/icp-inception-amd64:3.1.0-ee worker -l 10.97.38.82

To resolve this issue, perform the following steps to remove the check for nfnetlink module when using the inception installer.

Get the network file from the inception image:

cd /opt/ibm-cloud-private-3.1.0/cluster
sudo docker run -v $(pwd):/data -e LICENSE=accept ibmcom/icp-inception-amd64:3.1.0-ee cp -r playbook/roles/network-check/tasks/calico.yaml /data

Edit network check file to remove line - nfnetlink from Calico Validation - Preparing list of required kernel modules.

vi calico.yaml

When adding a worker node or run any other installer commands, you must mount the edited file into the container. For example:

docker run -e LICENSE=accept --net=host -v "$(pwd)":/installer/cluster -v "$(pwd)"/calico.yaml:/installer/playbook/roles/network-check/tasks/calico.yaml ibmcom/icp-inception-amd64:3.1.0-ee worker -l 10.97.38.82

For more information on installing IBM Cloud Private on Red Hat Enterprise Linux version 7.9, see Supported operating systems and platforms.

Container fails to start due to Docker issue

Installation fails during container creation due to a Docker 18.03.1 issue. If you have a subPath in the volume mount, you might receive the following error from the kubelet service, which fails to start the container:

Error: failed to start container "heketi": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/7e9cb34c-b2bf-11e8-a9eb-0050569bdc9f/volume-subpaths/heketi-db-secret/heketi/0\\\" to rootfs \\\"/var/lib/docker/overlay2/ca0a54812c6f5718559cc401d9b73fb7ebe43b2055a175ee03cdffaffada2585/merged\\\" at \\\"/var/lib/docker/overlay2/ca0a54812c6f5718559cc401d9b73fb7ebe43b2055a175ee03cdffaffada2585/merged/backupdb/heketi.db.gz\\\" caused \\\"no such file or directory\\\"\"": unknown

For more information, see the Kubernetes documentation Opens in a new tab .

To resolve this issue, delete the failed pod and try the installation again.

Statefulset pod fails to mount the volume on IBM Cloud Private on OpenShift Container Platform PowerVC cluster in Linux on Power environment

In an IBM Power environment, this issue appears when storageclass: "ibm-powervc-k8s-volume-default" is defined in the cluster. to identify this issue, complete the following steps:

Enter the following command to list your storage classes.

# kubectl get sc

The result lists the storage classes, which appear similar to the following example:

NAME                                       PROVISIONER                          AGE
ibm-powervc-k8s-volume-default (default)   ibm/powervc-k8s-volume-provisioner   8d

When the Statefulset pod is not in the Running state, check the pod description by using the kubectl command, as shown in the following example:
```
# kubectl get pods -n cem | grep -v  Running
```
You can see that the pod is not running in the following example:
```
scao-ibm-cem-datalayer-0                                      0/1      ContainerCreating      0          17h
```

Check the pod description by using the kubectl command, as shown in the following example:

# kubectl describe pod scao-ibm-cem-datalayer-0 -n cem

The response might be similar to the following example:

Warning FailedMount 33s (x14 over 14m) kubelet, kt-ocp-with-fvd-massive-cod-worker-2.novalocal MountVolume.MountDevice failed for volume "pvc-7264cfc9-d56c-11e9-960a-fad85bbb5a20" : mountdevice command failed, status: Failed, reason: Could not create file system on attached volume directory /dev/dm-9. Error is exit status 1

This confirms the issue. The failure occurs because the FlexVolume driver does not correctly remove the mounts and multipath devices after the FlexVolume driver LUN is unmapped or deleted from the worker node.

To resolve this issue, complete the following steps:

Log in to the worker nodes where the pod was scheduled.
Manually delete the multipath drive and devices.

Determine which device is mapped to /dev/dm-x using the multipath command, as shown in the following example:

# multipath -ll

The response might look like the following content:

mpathj (36005076d0281005ef000000000002db0) dm-9 IBM     ,2145
size=5.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 1:0:0:4 sdq  65:0   active ready running
| `- 2:0:1:4 sdv  65:80  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
 |- 1:0:1:4 sdr  65:16  active ready running
 `- 2:0:0:4 sdt  65:48  active ready running

Determine whether the multipath drive, mpathj, is an unmapped LUN from the SAN storage by running multipath on the drive. For example:

# multipath /dev/mapper/mpathj

If the LUN is unmapped from the SAN, the response might be similar to the following example:

Sep 13 02:50:19 | sdx: couldn't get target port group
Sep 13 02:50:19 | sdag: couldn't get target port group
Sep 13 02:50:19 | sds: couldn't get target port group
Sep 13 02:50:19 | sdah: couldn't get target port group
Sep 13 02:50:19 | sdu: couldn't get target port group
Sep 13 02:50:19 | sdai: couldn't get target port group
Sep 13 02:50:19 | sdw: couldn't get target port group
Sep 13 02:50:19 | sdaj: couldn't get target port group
Sep 13 02:50:19 | 65:0: path wwid appears to have changed. Using map wwid.

Sep 13 02:50:19 | 65:80: path wwid appears to have changed. Using map wwid.

Sep 13 02:50:19 | 65:16: path wwid appears to have changed. Using map wwid.

Sep 13 02:50:19 | 65:48: path wwid appears to have changed. Using map wwid.

Delete the multipath drive by entering the following command:
```
#  multipath -f mpathj
```

Delete the block device by entering the following commands:

# echo 1 > /sys/block/sdq/device/delete
# echo 1 > /sys/block/sdv/device/delete
# echo 1 > /sys/block/sdr/device/delete
# echo 1 > /sys/block/sdt/device/delete

Confirm that the pod status is in the running state by entering the following command:
```
# kubectl get pods -n cem | grep "scao-ibm-cem-datalayer-0"
```
The result should confirm that the pod is in the running state, as shown in the following example:
```
scao-ibm-cem-datalayer-0                                      1/1        Running    0          17h
```

Certificate errors occur when you upgrade with kubelet dynamic configuration enabled

If you upgrade to IBM Cloud Private Version 3.2.1 from an earlier version where kubelet dynamic configuration is enabled, you might encounter certificate-related errors when you try to call the kubelet API or run Kubernetes CLI commands.

By default in IBM Cloud Private Version 3.2.1, kubelet server certification rotation is enabled. If you enabled the kubelet dynamic configuration in your earlier version of IBM Cloud Private, the dynamic configuration does not retrieve the latest kubelet service configuration when you upgrade to version 3.2.1. This results in a mismatch of the certificates between the Kubernetes API server (kube-apiserver) and the kubelet service where the dynamic configuration is enabled.

For more information about this issue, see Troubleshooting: Certificate errors occur after upgrade with kubelet dynamic configuration.

After an upgrade a kubelet use of closed network connection error causes pods to remain in terminating status

When you upgrade to the 3.2.2.2006 or 3.2.2.2008 fix pack, you might encounter errors that affect the kubelet and that cause pods to remain in a terminating status. If this occurs, you might see use of closed network connection errors within the kubelet logs.

To resolve the errors and restore the nodes to a ready state, routinely restart the kubelet on the affected nodes. To automatically restart the kubelet when this error occurs, run the following bash script on a schedule, such as every five minutes.

#!/bin/bash
output=$(journalctl -u kubelet -n 1 | grep "use of closed network connection")
if [[ $? != 0 ]]; then
  echo "Error not found in logs"
elif [[ $output ]]; then
  echo "Restart kubelet"
  systemctl restart kubelet
fi

IBM Cloud Private version is not updated after rollback

If you upgraded IBM Cloud Private Version 3.2.1 to Version 3.2.1.2008 by applying the fix pack, and then rolled back to IBM Cloud Private Version 3.2.1, the IBM Cloud Private version still shows as 3.2.1.2008. You see the wrong version only in the CLI. The management console shows the correct version.

Following is the CLI command to view the IBM Cloud Private version:

cat /opt/ibm/cfc/version

After an upgrade Vulnerability Advisor pod in crashbackoff when updating the default policy

After applying a fix pack you can encounter an error when updating the default Vulnerability Advisor policy. This error affects the vulnerability-advisor-sas-apiserver pod and can cause the policy update to fail and the pod to be in a crashbackoff state. To resolve this issue, regnerate the ES temple and indices by completing the following steps:

Stop sas-api by updating the replica value to 0:

kubectl edit deployment sas-api-server --namespace=kube-system

Ensure that the pod successfully restarts after the update.

kubectl --namespace=kube-system get pods |grep sas-api

Log in to Kibana Dev Tools and delete the sas_policy index and sas_policy template:
```
DELETE sas_info
```
```
DELETE _template/sas_info
```

Redeploy sas-api by updating the replica to 1:

kubectl edit deployment sas-api-server --namespace=kube-system

Ensure that the pod successfully restarts after the update.

kubectl --namespace=kube-system get pods |grep sas-api

multicluster-hub-etcd and multicluster-hub-core pods fail to start after an install or upgrade

After you install or upgrade to the 3.2.2.2012 fix pack version with IBM Multicloud Manager enabled, the multicluster-hub-etcd and multicluster-hub-core pods can fail to start and be in a CrashLoopBackOff state.

This error occurs because the mount directory does not have the required file permission (700) set. To resolve this issue, mount a local directory as a replacement:

Create a local directory, such as /var/lib/etcd-mcm and set the directory permission to 700.

Modify the etcd statefuleset to use the local path as storage volume by running the following command:

kubectl edit statefulset multicluster-hub-etcd -n kube-system

      volumes:
          - name: etcd-data
             persistentVolumeClaim:
                 claimName: multicluster-hub-ibm-mcm-prod-etcd-pvc

Replace the preceding etcd-data volume to be the following setting where the path is your local directory:

      volumes:
          - name: etcd-data
             hostPath:
                 path: /var/lib/etcd-mcm
                 type: ""

Security and compliance

Vulnerability Advisor cross-architecture image scanning does not work with glibc version earlier than 2.22
Vulnerability Advisor policy resets to the default setting after you upgrade from 3.2.0 in ppc64le cluster
ACME HTTP issuer cannot issue certificates in OpenShift clusters
ACME HTTP issuer image is not copied to the worker nodes
Cannot get secret by using kubectl command when encryption of secret data at rest is enabled
LDAP user names are case-sensitive
Vulnerability Advisor cannot scan unsupported container images
Image security enforcement is only supported by IBM Multicloud Manager registries

Vulnerability Advisor cross-architecture image scanning does not work with `glibc` version earlier than 2.22

Vulnerability Advisor (VA) now supports cross-architecture image scanning with QEMU (Quick EMUlator). You can scan Linux® on Power® (ppc64le) CPU architecture images with VA running on Linux® nodes. Alternatively, you can scan Linux CPU architecture images with VA running on Linux® on Power® (ppc64le) nodes.

When scanning Linux images, you must use glibc version 2.22 or later. If you use glibc version earlier than 2.22, the scan might not work when VA runs on Linux® on Power® (ppc64le) nodes. Glibc versions earlier than 2.22 make certain syscalls (time/vgetcpu/getttimeofday) by using vsyscall mechanisms. The syscall implementation attempts to access hardcoded static address, which QEMU fails to translate while running in emulation mode.

Vulnerability Advisor policy resets to default setting after upgrade from 3.2.0 in ppc64le cluster

If you enabled Vulnerability Advisor (VA) on your Linux® on Power® (ppc64le) cluster in 3.2.0, the Vulnerability Advisor policy resets to the default setting when you upgrade to 3.2.1. To fix this issue, reset the VA policy in the management console.

ACME HTTP issuer cannot issue certificates in OpenShift clusters

IBM Cloud Private Version 3.2.1 does not apply the required permissions to the default service account for the certificate manager service in OpenShift clusters. This limitation prevents the ACME HTTP issuer from being able to process challenge requests, which prevents certificates from being issued from this issuer.

ACME HTTP issuer image is not copied to the worker nodes

The ACME HTTP issuer is added in IBM Cloud Private Version 3.2.1. You can configure the ACME HTTP issuer in your cluster to create certificates from a trusted certificate authority (CA). This feature is optional. If you choose to configure this feature in your cluster, you must complete either of the following steps:

Manually push the following Docker image to all the worker nodes in your cluster:
```
ibmcom/icp-cert-manager-acmesolver:0.7.0
```
(OR)
Create an image pull secret and associate it with the default service account for the namespace where you are creating the certificates. For more information, see Add ImagePullSecrets to a service account .

Cannot get secret by using kubectl command when encryption of secret data at rest is enabled

When you enable encryption of secret data at rest, and use kubectl command to get the secret, sometimes you might not be able to get the secret. You might see the following error message in kube-apiserver:

Internal error occurred: invalid padding on input

This error occurs because kube-apiserver failed to decrypt the encrypted data in etcd. For more information about the issue, see Random "invalid padding on input" errors when attempting various kubectl operations Opens in a new tab .

To resolve the issue, delete the secret and re-create it. Use the following command:

kubectl -n <namespace> delete secret <secret>

For more information about encrypting secret data at rest, see Encrypting Secret Data at Rest Opens in a new tab .

LDAP user names are case-sensitive

User names are case-sensitive. You must use the name exactly the way it is configured in your LDAP directory.

Vulnerability Advisor cannot scan unsupported container images

Container images that are not supported by the Vulnerability Advisor fail the security scan.

The Security Scan column displays Failed from the Container Images page in the management console. When you select failed container image name to view more details, zero issues are detected.

Image security enforcement is only supported by IBM Multicloud Manager registries

When you enable Vulnerability Advisor (VA) scanning in the ImagePolicy and ClusterImagePolicy specification, you are unable to create workloads in the associated namespaces. The VA scanning integration with image security enforcement only supports the built-in IBM Multicloud Manager registry. For more information, see Scanning an image registry with the Vulnerability Advisor (VA).

Network

Cookie affinity doesn’t work when FIPS is enabled
IPv6 is not supported
Calico prefix limitation on Linux® on Power® (ppc64le) nodes
Enable Ingress Controller to use a new annotation prefix
Intermittent failure when you log in to the management console in HA clusters that use NSX-T 2.3 or 2.4
Encrypting cluster data network traffic with IPsec does not work on SLES 12 SP3 operating system
Elasticsearch does not work with GlusterFS
NGINX ingress rewrite-target annotation fails when you upgrade to IBM Cloud Private Version 3.2.1
Pods not reachable from NGINX ingress controller in OpenShift Version 3.11 multitenant mode
kube-dns pods are not evicted even though the node is down

Cookie affinity does not work when FIPS is enabled

When Federal Information Processing Standard (FIPS) is enabled, cookie affinity doesn't work because nginx.ingress.kubernetes.io/session-cookie-hash can be set only to sha1/md5/index, which is not supported in FIPS mode.

IPv6 is not supported

IBM Cloud Private cannot use IPv6 networks. Comment out the settings in the /etc/hosts file on each cluster node to remove the IPv6 settings. For more information, see Configuring your cluster.

Calico prefix limitation on Linux® on Power® (ppc64le) nodes

If you install IBM Cloud Private on PowerVM Linux LPARs and your virtual Ethernet devices use the ibmveth prefix, you must set the network adapter to use Calico networking. During installation, be sure to set a calico_ip_autodetection_method parameter value in the config.yaml file. The setting resembles the following content:

calico_ip_autodetection_method: interface=<device_name>

The <device_name> parameter is the name of your network adapter. You must specify the ibmveth0 interface on each node of the cluster, including the worker nodes.

Note: If you used PowerVC to deploy your cluster node, this issue does not affect you.

Enable Ingress Controller to use a new annotation prefix

The NGINX ingress annotation contains a new prefix in version 0.9.0 that is used in IBM Cloud Private 3.2.1 nginx.ingress.kubernetes.io. This change uses the flag to avoid breaks to deployments that are running.
To avoid breaking a running NGINX ingress controller, add the --annotations-prefix=ingress.kubernetes.io flag to the nginx ingress controller deployment. The product accepts the flag by default in IBM Cloud Private ingress controller.
If you want to use the new ingress annotation, update the ingress controller by removing the --annotations-prefix=ingress.kubernetes.io flag. To remove the flag, run the following commands:

Note: Run the following commands from the master node.

For Linux®, run the following command:
```
 kubectl edit ds nginx-ingress-lb-amd64 -n kube-system
```
For Linux® on Power® (ppc64le) run the following command:
```
 kubectl edit ds nginx-ingress-lb-ppc64le -n kube-system
```
Save and exit to implement the change. Ingress controller restarts to receive the new configuration.

Intermittent failure when you log in to the management console in HA clusters that use NSX-T 2.3 or 2.4

In HA clusters that use NSX-T 2.3 or 2.4, you might not be able to log in to the management console. After you specify the login credentials, you are redirected to the login page. You might have to try logging in multiple times until you succeed. This issue is intermittent.

Encrypting cluster data network traffic with IPsec does not work on SLES 12 SP3 operating system

strongSwan version 5.3.3 or higher is necessary to deploy IPsec mesh configuration for cluster data network traffic encryption. In SUSE Linux Enterprise Server (SLES) 12 SP3, the default strongSwan version is 5.1.3, which is not suitable for IPsec mesh configuration.

Elasticsearch does not work with GlusterFS

Elasticsearch does not work correctly with GlusterFS that is configured in an IBM® Cloud Private environment. This issue is due to the following AlreadyClosedException error. For more information, see Red Hat Bugzilla – Bug 1430659.

[2019-01-17T10:53:49,750][WARN ][o.e.c.a.s.ShardStateAction] [logging-elk-master-7df4b7bdfc-5spqc] \
[logstash-2019.01.16][3] received shard failed for shard id [[logstash-2019.01.16][3]], allocation id \
[n9ZpABWfS4qJCyUIfEgHWQ], primary term [0], message [shard failure, reason \
[already closed by tragic event on the index writer]], \
failure [AlreadyClosedException[Underlying file changed by an external force at 2019-01-17T10:44:48.410502Z,\
(lock=NativeFSLock(path=/usr/share/elasticsearch/data/nodes/0/indices/R792nkojQ7q1UCYSEO4trQ/3/index/write.lock,\
impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2019-01-17T10:44:48.410324Z))]]
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2019-01-17T10:44:48.410502Z,\
(lock=NativeFSLock(path=/usr/share/elasticsearch/data/nodes/0/indices/R792nkojQ7q1UCYSEO4trQ/3/index/write.lock,\
impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2019-01-17T10:44:48.410324Z))

NGINX ingress rewrite-target annotation fails when you upgrade to IBM Cloud Private Version 3.2.1

IBM® Cloud Private Version 3.2.1 uses NGINX Ingress Controller Version 0.23.0. Starting in NGINX Ingress Controller Version 0.22.0, ingress definitions that use the annotation nginx.ingress.kubernetes.io/rewrite-target are not compatible with an earlier version. For more information, see Rewrite Target Opens in a new tab .

When you upgrade to IBM Cloud Private Version 3.2.1, you must replace the ingress.kubernetes.io/rewrite-target annotation with the following piece of code:

    ingress.kubernetes.io/use-regex: "true"
    ingress.kubernetes.io/configuration-snippet: |
      rewrite "(?i)/old/(.*)" /new/$1 break;
      rewrite "(?i)/old$" /new/ break;

Where, old is the path that is defined in your ingress resource, and new is the URI to access your application.

For example, if web/nginx is the path for your NGINX application in the ingress source, and the URI to access your application is /, then you rewrite the annotation as shown in the following example:

# kubectl get ingress nginx -o yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    ingress.kubernetes.io/configuration-snippet: |
      rewrite "(?i)/web/nginx/(.*)" /$1 break;
      rewrite "(?i)/web/nginx$" / break;
    ingress.kubernetes.io/use-regex: "true"
  name: nginx
  namespace: default
spec:
  rules:
  - host: demo.nginx.net
    http:
      paths:
      - backend:
          serviceName: nginx
          servicePort: 80
        path: /web/nginx
status:
  loadBalancer:
    ingress:
    - ip: 9.30.118.39

# curl http://demo.nginx.net/web/nginx
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>

Pods not reachable from NGINX ingress controller in OpenShift Version 3.11 multitenant mode

In OpenShift Version 3.11 clusters with multitenant isolation mode, each project is isolated by default. Network traffic is not allowed between pods or services in different projects.

To resolve the issue, disable network isolation in the kube-system project.

oc adm pod-network make-projects-global kube-system

kube-dns pods are not evicted even though the node is down

In ICP 3.2.0 and 3.2.1, kube-dns is deployed as DaemonSet. The controller will add the following toleration to the pods automatically:

  tolerations:
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready`
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists

Because node taints node.kubernetes.io/not-ready and node.kubernetes.io/unreachable are tolerated, the kube-dns pods are not evicted when the node is down, or, when the Docker or kubelet service is stopped. As a result, DNS requests are still routing to the kube-dns pods that are not running.

The workaround for this problem is to add tolerationSeconds to kube-dns DaemonSet:

- effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120

Setting the tolerationSeconds to 120 will force the pod to be evicted after 120 seconds.

Catalog

Namespace deletion fails when Catalog is enabled

You cannot delete a namespace in your IBM Cloud Private cluster. Messages resembling the following example indicate that the namespace is Terminating.

# kubectl  get ns services-2168 -o yaml
apiVersion: v1
kind: Namespace
...
spec:
  finalizers:
  - kubernetes
status:
  conditions:
  - lastTransitionTime: "2020-07-17T03:23:29Z"
    message: All resources successfully discovered
    reason: ResourcesDiscovered
    status: "False"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2020-07-17T03:23:29Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2020-07-17T03:23:29Z"
    message: 'Failed to delete all resource types, 5 remaining: object *v1beta1.ServiceBindingList
      does not implement the protobuf marshalling interface and cannot be encoded
      to a protobuf message, object *v1beta1.ServiceBrokerList does not implement
      the protobuf marshalling interface and cannot be encoded to a protobuf message,
      object *v1beta1.ServiceClassList does not implement the protobuf marshalling
      interface and cannot be encoded to a protobuf message, object *v1beta1.ServiceInstanceList
      does not implement the protobuf marshalling interface and cannot be encoded
      to a protobuf message, object *v1beta1.ServicePlanList does not implement the
      protobuf marshalling interface and cannot be encoded to a protobuf message'
    reason: ContentDeletionFailed
    status: "True"
    type: NamespaceDeletionContentFailure
  phase: Terminating

This event occurs when the Catalog is not updated to be consistent with the latest Kubernetes release. Objects that are managed by the Catalog apiserver do not implement the protobuf marshalling interface. For more information, see Kubernetes community issue 86666 Opens in a new tab .

To work around this issue, delete the Catalog API service.

Note: This action impacts you if you use the Catalog or the Cloud Automation Manager (CAM) that is available with IBM Cloud Pak for Multicloud Management. The API service cannot reach the Catalog apiserver.

Target cluster is required message is displayed when deploying a Helm chart in the Catalog

When you deploy a Helm chart into a namespace shortly after deploying a different Helm chart into the same namespace, you might see message that reads Target Cluster is required in the Target Cluster field. This is because the cluster is being refreshed in the background, and you cannot deploy to this namespace while the cluster is being refreshed. After the refresh is complete, you can select a Target Cluster, and the message is removed. This situation does not prevent you from deploying the chart.

Storage

Cannot restart node when using vSphere storage that has no replica
GlusterFS cluster becomes unusable if you configure a vSphere Cloud Provider after installing IBM Cloud Private
live-crawler pods reach the memory resource limit on Linux on Power (ppc64le)

Cannot restart node when using vSphere storage that has no replica

Shutting down a cluster in an IBM Cloud Private environment that uses vSphere Cloud moves the pod to another node in your cluster. However, the vSphere volume that the pod uses on the original node is not detached from the node. An error might occur when you try to restart the node.

To resolve the issue, first detach the volume from the node. Then, restart the node.

GlusterFS cluster becomes unusable if you configure a vSphere Cloud Provider after installing IBM Cloud Private

By default, the kubelet uses the IP address of the node as the node name. When you configure a vSphere Cloud Provider, kubelet uses the host name of the node as the node name. If you had your GlusterFS cluster set up during installation of IBM Cloud Private, Heketi creates a topology by using the IP address of the node.

When you configure a vSphere Cloud Provider after you install IBM Cloud Private, your GlusterFS cluster becomes unusable because the kubelet identifies nodes by their host names, but Heketi still uses IP addresses to identify the nodes.

If you plan to use both GlusterFS and a vSphere Cloud Provider in your IBM Cloud Private cluster, ensure that you set kubelet_nodename: hostname in the config.yaml file during installation.

live-crawler pods reach the memory resource limit on Linux on Power (ppc64le)

vulnerability-advisor-live-crawler pods reach the memory resource limit on Linux on Power (ppc64le).

You can increase the memory resource limit for the vulnerability-advisor-live-crawler DaemonSet. Complete the following steps to increase the memory resource limit:

Edit the vulnerability-advisor-live-crawler DaemonSet by running the following command:
```
kubectl edit ds vulnerability-advisor-live-crawler -n kube-system
```
Update the memory resource limit to 2Gi. Your vulnerability-advisor-live-crawler DaemonSet might resemble the following configuration:
```
resources:
  limits:
    cpu: 500m
    memory: 2Gi
```

Your memory resource limit is increased.

Monitoring and logging

Elasticsearch type mapping limitations
Alerting, logging, or monitoring pages displays 500 Internal Server Error
Monitoring data is not retained if you use a dynamically provisioned volume during upgrade
Prometheus data source is lost during a rollback of IBM Cloud Private
Logs not working after logging pods are restarted
getFreePhysicalMemorySize() is not supported in IBM® Z platforms
custom-metrics-adapter data does not match the CPU and memory usage for pods
Prometheus and Alertmanager pod recovery when node crashes
Rolling back to older versions
CrashLoopBackOff status for monitoring-prometheus-operator pod in OpenShift environment
Prometheus and Alertmanager persistent volumes remain in Pending status with GlusterFS
VolumeMounts error when you upgrade a monitoring chart

Elasticsearch type mapping limitations

The IBM Cloud Private logging component uses Elasticsearch to store and index logs that are received from all the running containers in the cluster. If containers emit logs in JSON format, each field in the JSON is indexed by Elasticsearch to allow queries to use the fields. However, if two containers define the same field while they send different data types, Elasticsearch is not able to index the field correctly. The first type that is received for a field each day sets the accepted type for the rest of the day. This action results in two problems:

In IBM Cloud Private version 3.1.2 and earlier, log messages with non-matching types are discarded. In IBM Cloud Private version 3.2.0 and later, the log messages are accepted but the non-matching fields are not indexed. If you run a query using that field, you do not find the non-matching documents. Some scenarios primarily involving fields that are sometimes objects can still result in discarded log messages. For more information, see Elasticsearch issue 12366 .
If the type for a field is different over several days, queries from Kibana can result in errors such as 5 of 30 shards failed. To work around this issue, complete the following steps to force Kibana to recognize the type mismatch:
- From the Kibana navigation menu, click Management
- Select Index patterns
- Click Refresh field list

Alerting, logging, or monitoring pages displays 500 Internal Server Error

To resolve this issue, complete the following steps from the master node:

Create an alias for the insecure kubectl api log in by running the following command:
```
alias kc='kubectl -n kube-system'
```

Edit the configuration map for Kibana. Run the following command:

kc edit cm kibana-nginx-config

Add the following updates:

 upstream kibana {
 server localhost:5602;
 }
 Change localhost to 127.0.0.1

Locate and restart the Kibana pod by running the following commands:
```
 kc get pod | grep -i kibana
```
```
 kc delete pod <kibana-POD_ID>
```

Edit the configuration map for Grafana by running the following command:

kc edit cm grafana-router-nginx-config

Add the following updates:

upstream grafana {
server localhost:3000;
}
Change localhost to 127.0.0.1

Locate and restart the Grafana pod by running the following commands:

kc get pod | grep -i monitoring-grafana

kc delete pod <monitoring-grafana-POD_ID>

Edit the configuration map for the Alertmanager by running the following command:

kc edit cm alertmanager-router-nginx-config

Add the following updates:

upstream alertmanager {
server localhost:9093;
}
Change localhost to 127.0.0.1

Locate and restart the Alertmanager by running the following commands:

kc get pod | grep -i monitoring-prometheus-alertmanager

kc delete pod <monitoring-prometheus-alertmanager-POD_ID>

Monitoring data is not retained if you use a dynamically provisioned volume during upgrade

If you use a dynamically provisioned persistent volume to store monitoring data, the data is lost after you upgrade the monitoring service from 2.1.0.2 to 2.1.0.3.

Prometheus data source is lost during a rollback of IBM Cloud Private

When you roll back from IBM Cloud Private Version 3.2.1 to 3.2.0, the Prometheus data source in Grafana is lost. The Grafana dashboards do not display any metrics.

To resolve the issue, add back the Prometheus data source by completing the steps in the Manually configure a Prometheus data source in Grafana section.

Logs not working after logging pods are restarted

You might encounter the following problems:

The Kibana web UI shows Elasticsearch health status as red.

The Elasticsearch client pod log messages indicate that Search Guard is not initialized. Note that the same error repeats every few seconds. The messages resemble the following:

[2018-11-08T20:43:54,380][ERROR][c.f.s.a.BackendRegistry  ] Not yet initialized (you may need to run sgadmin)
[2018-11-08T20:43:54,487][ERROR][c.f.s.a.BackendRegistry  ] Not yet initialized (you may need to run sgadmin)
[2018-11-08T20:43:54,488][ERROR][c.f.s.a.BackendRegistry  ] Not yet initialized (you may need to run sgadmin)

If Vulnerability Advisor (VA) is installed, an error message appears in your VA logs that resembles the following:

2018-10-31 07:25:12,083 ERROR 229 <module>: Error: TransportError(503, u'Search Guard not initialized (SG11). See https://github.com/floragunncom/search-guard-docs/blob/master/sgadmin.md', None)

To resolve this issue, complete the following steps to run a Search Guard initialization job:

Save the existing Search Guard initialization job to a file.

  kubectl get job.batch/<RELEASE_PREFIX>-elasticsearch-searchguard-init -n kube-system -o yaml > sg-init-job.yaml

Logging in IBM Cloud Private version 3.2.1 changed to remove the job after completion. If you do not have an existing job from which to extract the settings to a file, you can save the following YAML file to the sg-init-job.yaml file.

 apiVersion: batch/v1
  kind: Job
  metadata:
    labels:
      app: <RELEASE_PREFIX>-elasticsearch
      chart: ibm-icplogging-2.2.0 # Update to the correct version of logging installed. Current chart version can be found in the Service Catalog
      component: searchguard-init
      heritage: Tiller
      release: logging
    name: <RELEASE_PREFIX>-elasticsearch-searchguard-my-init-job # change this to a unique value
    namespace: kube-system
  spec:
    backoffLimit: 6
    completions: 1
    parallelism: 1
    template:
      metadata:
        creationTimestamp: null
        labels:
          app: <RELEASE_PREFIX>-elasticsearch
          chart: ibm-icplogging
          component: searchguard-init
          heritage: Tiller
          release: logging
          role: initialization
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: beta.kubernetes.io/arch
                  operator: In
                  values:
                  - amd64
                  - ppc64le
                  - s390x
                - key: management
                  operator: In
                  values:
                  - "true"
        containers:
        - env:
          - name: APP_KEYSTORE_PASSWORD
            value: Y2hhbmdlbWU=
          - name: CA_TRUSTSTORE_PASSWORD
            value: Y2hhbmdlbWU=
          - name: ES_INTERNAL_PORT
            value: "9300"
          image: ibmcom/searchguard-init:2.0.1-f2 # This value may be different from the one on your system; double check by running docker image | grep searchguard-init
          imagePullPolicy: IfNotPresent
          name: searchguard-init
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /usr/share/elasticsearch/config/searchguard
            name: searchguard-config
          - mountPath: /usr/share/elasticsearch/config/tls
            name: certs
            readOnly: true
        dnsPolicy: ClusterFirst
        restartPolicy: OnFailure
        schedulerName: default-scheduler
        securityContext: {}
        terminationGracePeriodSeconds: 30
        tolerations:
        - effect: NoSchedule
          key: dedicated
          operator: Exists
        volumes:
        - configMap:
            defaultMode: 420
            name: <RELEASE_PREFIX>-elasticsearch-searchguard-config
          name: searchguard-config
        - name: certs
          secret:
            defaultMode: 420
            secretName: <RELEASE_PREFIX>-certs

Notes:

Modify your chart version to the version that is installed on your system. You can find the current chart version in the Service Catalog.
This image might be different from the one on your system: image: ibmcom/searchguard-init:2.0.1-f2. Run command, docker image | grep searchguard-init to confirm that the correct image is installed on your system.
The <RELEASE_PREFIX> value for managed mode logging instances is different from the value for standard mode logging instances.
- For managed logging instances that are installed with IBM Cloud Private installer, the value is logging-elk.
- For standard logging instances that are installed after IBM Cloud Private installation from either the Service Catalog or by using the Helm CLI, the value is <RELEASE-NAME>-ibm-icplogging. <RELEASE-NAME> is the name that is given to the Helm release when this logging instance is installed.

Edit the job file.
1. Remove everything under metadata.* except for the following parameters:
  - metadata.name
  - metadata.namespace
  - metadata.labels.*
2. Change metadata.name and spec.template.metadata.job-name to new names.
3. Remove spec.selector, spec.template.metadata.labels.controller-uid
4. Remove status.*
Save the file.
Run the job.
```
kubectl apply -f sg-init-job.yaml
```

`getFreePhysicalMemorySize()` is not supported in IBM® Z platforms

Messages that resemble the following appear in Elasticsearch logs on IBM Z platforms:

[2019-07-07T12:39:53,698][WARN ][o.e.d.c.u.ByteSizeValue ] [logging-elk-client-54595cd8c-rfw5s]Values less than -1 bytes are deprecated and will not be supported in the next major version: [-5938528256b]
[2019-07-07T12:39:55,254][WARN ][o.e.d.c.u.ByteSizeValue ] [logging-elk-client-54595cd8c-rfw5s]Values less than -1 bytes are deprecated and will not be supported in the next major version: [-5915230208b]

The messages appear because the Java virtual machine (JVM) on a IBM Z platform returns message, free memory greater than total memory in Java Management Extensions (JMX). getFreePhysicalMemorySize() is not supported in IBM Z platforms.

custom-metrics-adapter data does not match the CPU and memory usage for pods

The custom-metric-adapter API readings do not match the CPU and memory usage for pods.

Get information about your custom metric by running the following command:

  kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/custom_metric" | jq .

{: codeblock} your CPU and memory usage:

For example, get your CPU usage by running the following command:

   kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/cpu_usage" | jq .

Your output might resemble the following content:

   {
     "kind": "MetricValueList",
     "apiVersion": "custom.metrics.k8s.io/v1beta1",
     "metadata": {
        "selfLink": "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/cpu_usage"
    },
    "items": [
      {
        "describedObject": {
          "kind": "Pod",
          "namespace": "default",
          "name": "podinfo-69c8c4b64f-l69lc",
          "apiVersion": "/__internal"
        },
        "metricName": "cpu_usage",
        "timestamp": "2019-09-12T14:17:26Z",
        "value": "0"
      },
      {
        "describedObject": {
          "kind": "Pod",
          "namespace": "default",
          "name": "podinfo-69c8c4b64f-wf25x",
          "apiVersion": "/__internal"
        },
        "metricName": "cpu_usage",
        "timestamp": "2019-09-12T14:17:26Z",
        "value": "0"
       }
    }

Prometheus and Alertmanager pod recovery when node crashes

A management node with active Prometheus (prometheus-monitoring-prometheus-0) and Alertmanager (alertmanager-monitoring-prometheus-alertmanager-0) pods, might crash and cannot recover. The pods cannot automatically recover on another management node. To manually trigger the recovery of the pods, you must first forcefully delete the pods.

kubectl delete pods prometheus-monitoring-prometheus-0 --namespace=kube-system --force --grace-period=0
kubectl delete pods alertmanager-monitoring-prometheus-alertmanager-0 --namespace=kube-system --force --grace-period=0

Rolling back to older versions

If your current monitoring service comes from an older version upgrade (1.5.x or earlier), you can roll back to the original version. However, is persistent volume is enabled for Prometheus or Alertmanager, you need to manually run commands to invoke the rollback command.

If persistent volume is enabled for Prometheus, run the following commands. monitoring-datanode-prometheus is the sample name for the PersistentVolume that is used by Prometheus.

kubectl delete pvc/prometheus-monitoring-prometheus-db-prometheus-monitoring-prometheus-0 &
kubectl patch pvc prometheus-monitoring-prometheus-db-prometheus-monitoring-prometheus-0 -p '{"metadata":{"finalizers": null}}'
kubectl patch pv monitoring-datanode-prometheus --type=json -p='[{"op": "remove", "path": "/spec/claimRef"}]'

If persistent volume is enabled for Alertmanager, run the following commands. monitoring-datanode-alertmanager is the sample name for the PersistentVolume that is used by Alertmanager.

kubectl delete pvc/alertmanager-monitoring-prometheus-alertmanager-db-alertmanager-monitoring-prometheus-alertmanager-0 &
kubectl patch pvc alertmanager-monitoring-prometheus-alertmanager-db-alertmanager-monitoring-prometheus-alertmanager-0 -p '{"metadata":{"finalizers": null}}'
kubectl patch pv monitoring-datanode-alertmanager --type=json -p='[{"op": "remove", "path": "/spec/claimRef"}]'

`CrashLoopBackOff` status for `monitoring-prometheus-operator` pod in OpenShift environment

You might find the monitoring-prometheus-operator pod in CrashLoopBackOff status in your OpenShift environment. This situation occurs when persistent volumes are enabled on OpenShift monitoring components such as Prometheus or Alertmanager.

There are two containers in the pod. If you run into this issue, the prometheus-operator container continues to run. Only the prometheus-operator-controller container crashes. To confirm this outcome, check the prometheus-operator-controller log. Error messages such as invalid memory address or nil pointer dereference appear in your log. For example:

time="2019-09-25T14:31:35Z" level=info msg="Controller.processNextItem: object created detected: openshift-monitoring/main"
time="2019-09-25T14:31:35Z" level=info msg=MonitoringHandler.ObjectCreated
E0925 14:31:35.528032       1 runtime.go:73] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 12 [running]:
github.ibm.com/IBMPrivateCloud/prometheus-operator-controller/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x113a140, 0x1f20050)
    /home/travis/build/IBMPrivateCloud/src/github.ibm.com/IBMPrivateCloud/prometheus-operator-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:69 +0x82
github.ibm.com/IBMPrivateCloud/prometheus-operator-controller/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)

To bypass this issue, you must disable the persistent volumes in your OpenShift Prometheus k8s, and Alertmanager main resources. For more information, see Configuring cluster monitoring in OpenShift Opens in a new tab .

Ignore this issue if persistent volumes are not enabled for your IBM Cloud Private monitoring service.

Prometheus and Alertmanager persistent volumes remain in Pending status with GlusterFS

With GlusterFS as the storage provider, the persistent volume claims for Prometheus and Alertmanager stay in PENDING status due to a known GlusterFS problem.

Create new Prometheus and Alertmanager resources to bypass the issue.

Export existing Prometheus and Alertmanager resources.

  kubectl get prometheus monitoriing-prometheus -n kube-system -o yaml > prometheus.yaml
  kubectl get alertmanager monitoring-prometheus-alertmanager -n kube-system -o yaml > alertmanager.yaml

Update the exported .yaml files. Delete creationTimestamp, generation, resourceVersion, selfLink, uid in metadata.
Update values for metadata.name and spec.PodMetadata.labels.app to a new string whose length is no more than 6 characters. For example, in prometheus.yaml, change the values from monitoring-prometheus to prometheus. In alertmanager.yaml, change the values from monitoring-prometheus-alertmanager to alertmanager.

Update the Prometheus and Alertmanager services. Update the value of spec.selector.prometheus/alertmanager to the new shorter strings:

  kubectl update svc monitoring-prometheus -n kube-system
  kubectl update svc monitoring-prometheus-alertmanager -n kube-system

Using a new name, re-create the Secret for Alertmanager:

  kubectl get secret alertmanager-monitoring-prometheus-alertmanager -n kube-system -o alert-config.yaml

Update alert-config.yaml.
- Delete creationTimestamp, resourceVersion, selfLink, uid in metadata
- Update metadata.name from alertmanager-monitoring-prometheus-alertmanager to alertmanager-{NEW_NAME}. {NEW_NAME} is the shorter name that you want to use for Alertmanager.
```
kubectl apply -n kube-system -f alert-config.yaml
```

Delete existing Prometheus and Alertmanager resources, and create new resources.

```
kubectl delete prometheus monitoriing-prometheus -n kube-system
kubectl delete alertmanager monitoring-prometheus-alertmanager -n kube-system
kubectl apply -n kube-system -f prometheus.yaml
kubectl apply -n kube-system -f alertmanager.yaml
```

`VolumeMounts` error when you upgrade a monitoring chart

When you upgrade a monitoring chart from the version that is used in IBM Cloud Private Version 3.2.1 or earlier, the Helm upgrade fails with an error that resembles the following message:

Error: UPGRADE FAILED: Failed to recreate resource: Deployment.apps "monitoring-grafana" is invalid: spec.template.spec.containers[0].volumeMounts[5].name: Not found: "monitoring-certs"

To resolve the problem during a cluster upgrade, add the following section to your config.yaml:

upgrade_override:
  monitoring:
    tls:
      enabled: true

Platform management

Resource quota might not update
The Key Management Service must deploy to a management node in a Linux® platform
Synchronizing repositories might not update Helm chart contents
Helm repository names cannot contain DBCS GB18030 characters
Container fails to operate or a kernel panic occurs
Timeouts and blank screens when displaying 80+ namespaces
Cloning an IBM Cloud Private worker node is not supported
LDAP search does not automatically show suggestions on keypress
Pod liveness or readiness check might fail because Docker failed to run some commands in the container
Pods show CreateContainerConfigError
Some Pods not starting or log TLS handshake errors in IBM Power environment
The web-terminal alerts user with goodbye if no namespace permission is assigned
Permission issue with Docker Version 18.03 with Ubuntu 16.04 LTS
Pod goes into a CrashLoopBackOff state when a modified subpath configmap mount fails

Resource quota might not update

You might find that the resource quota is not updating in the cluster. This is due to an issue in the kube-controller-manager. The workaround is to stop the kube-controller-manager leader container on the master nodes and let it restart. If high availability is configured for the cluster, you can check the kube-controller-manager log to find the leader. Only the leader kube-controller-manager is working. The other controllers wait to be elected as the new leader once the current leader is down.

For example:

# docker ps | grep hyperkube | grep controller-manager
97bccea493ea        4c7c25836910                                                                                              "/hyperkube controll…"   7 days ago          Up 7 days                               k8s_controller-manager_k8s-master-9.111.254.104_kube-system_b0fa31e0606015604c409c09a057a55c_2

To stop the leader, run the following command with the ID of the Docker process:

docker rm -f 97bccea493ea

The Key Management Service must deploy to a management node in a Linux® platform

The Key Management Service is deployed to the management node and is supported only on the Linux® platform. If there is no amd64 management node in the cluster, the Key Management Service is not deployed.

Synchronizing repositories might not update Helm chart contents

Synchronizing repositories takes several minutes to complete. While synchronization is in progress, there might be an error if you try to display the readme file. After synchronization completes, you can view the readme file and deploy the chart.

Helm repository names cannot contain DBCS GB18030 characters

Do not use DBCS GB18030 characters in the Helm repository name when you add the repository.

Container fails to operate or a kernel panic occurs

The following error might occur from the IBM Cloud Private node console or kernel log:

  kernel:unregister_netdevice: waiting for <eth0> to become free.

If you receive this error, the log displays both kernal:unregister_netdevice: waiting for <eth0> to be free and containers fail to operate. Continue to troubleshoot. If you meet all required conditions, reboot the node.

View https://github.com/kubernetes/kubernetes/issues/64743 Opens in a new tab to learn about the Linux Kernel bug that causes the error.

Timeouts and blank screens when displaying more than 80 namespaces

If a cluster has large number of namespaces, more than 80, you might see the following issues:

The namespace overview page might timeout and display a blank screen.
The Chart deployment configuration page might timeout and not load all the namespaces in the drop-down. Only the default namespace is shown for the deployment.

Cloning an IBM Cloud Private worker node is not supported

IBM Cloud Private does not support cloning an existing IBM Cloud Private worker node. You cannot change the host name and IP address of a node on your existing cluster.

You must add a new worker node. For more information, see Adding an IBM Cloud Private cluster node.

LDAP search does not automatically show suggestions on keypress

When you add users or user groups to your team, you can search for individual users and groups. As you type into the LDAP search bar, suggestions that are associated with the search query do not automatically appear. You must press the enter key to obtain results from the LDAP server. For more information, see Create teams.

Pod liveness or readiness check might fail because Docker failed to run some commands in the container

After you upgrade from IBM Cloud Private version 3.1.0, 3.1.1, or 3.1.2 to version 3.2.1, the readiness or liveness checks for some pods might fail. This failure can also happen when you deploy a workload in the cluster, or when you shut down or restart the management node in the cluster. This issue might occur on Prometheus pods, Grafana pods, or other pods.

Depending on factors like network stability, the readiness and liveness probes can take longer to start than the time allowed before a readiness request is sent. If they are not started when a request is sent, then they don't return a ready status.

The returned status request looks similar to the following example:

# kubectl get pods -o wide --all-namespaces |grep monitor
kube-system    monitoring-grafana-59bfb7859b-f9zrd                 2/3     Running     0          43m     10.1.1.1   9.1.1.1    <none>           <none>
kube-system    monitoring-prometheus-75b7444496-zzl7b                         3/4     Running     0          43m     10.1.1.1    9.1.1.1    <none>           <none>

Check the event log for the pod to see whether there are entries that are similar to the following content:

Events:
  Type     Reason     Age                    From                   Message
  ----     ------     ----                   ----                   -------
  **Warning  Unhealthy  2m29s (x23 over 135m)  kubelet, 9.1.1.1  Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  Unhealthy  2m13s (x23 over 135m)  kubelet, 9.1.1.1  Liveness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded**

To work around this issue, remove the failed pod and let it deploy again. You can also restart the Docker service on the cluster.

Pods show CreateContainerConfigError

After you install IBM Cloud Private, the following pods show CreateContainerConfigError error:

# kubectl get pods -o wide --all-namespaces |grep -v "Running" |grep -v "Completed"
NAMESPACE      NAME                                     READY    STATUS                            
kube-system    logging-elk-kibana-init-6z95k             0/1     CreateContainerConfigError       
kube-system    metering-dm-79d6f5894d-q2qpm              0/1     Init:CreateContainerConfigError   
kube-system    metering-reader-4tzgz                     0/1     Init:CreateContainerConfigError  
kube-system    metering-reader-5hjvm                     0/1     Init:CreateContainerConfigError  
kube-system    metering-reader-gsm44                     0/1     Init:CreateContainerConfigError  
kube-system    metering-ui-7dd45b4b6c-th2pg              0/1     Init:CreateContainerConfigError  
kube-system    secret-watcher-6bd4675db7-mcb64           0/1     CreateContainerConfigError       
kube-system    security-onboarding-262cp                 0/1     CreateContainerConfigError

The issue occurs when the pods are unable to create the IAM API key secret.

To resolve the issue, restart the iam-onboarding pod.

Complete the following steps:

Install kubectl. For more information, see Installing the Kubernetes CLI (kubectl).

Get the iam-onboarding pod ID and make a note of the pod ID.

kubectl -n kube-system get pods -o wide | grep iam-onboarding

Delete the iam-onboarding pod.

kubectl -n kube-system delete pod <iam-onboarding-pod-id>

Wait for two minutes and check the pod status.
```
kubectl -n kube-system get pods -o wide | grep iam-onboarding
```
The pod status shows as Running.

Some Pods not starting or log TLS handshake errors in IBM Power environment

In some cases when you are using IP-IP tunneling in an IBM Power environment, some of your Pods do not start or contain log entries that indicate TLS handshake errors.

If you notice either of these issues, complete the following steps to resolve the issue.

Run the ifconfig command or the netstat command to view the statistics of the tunnel device. The tunnel device is often named tunl0.

Note the changes in the TX dropped count that is displayed when you run the ifconfig command or the netstat command.

If you use the netstat command, enter a command similar to the following command:

 netstat --interface=tunl0

The output should be similar to the following content:

 Kernel Interface table
 Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
 tunl0     1300   904416      0      0 0        714067      0    806      0 ORU

If you use the ifconfig command, run a command similar to the following command:

 ifconfig tunl0

The output should be similar to the following content:

 tunl0: flags=193<UP,RUNNING,NOARP>  mtu 1300
 inet 10.1.125.192  netmask 255.255.255.255
 tunnel   txqueuelen 1000  (IPIP Tunnel)
 RX packets 904377  bytes 796710714 (759.8 MiB)
 RX errors 0  dropped 0  overruns 0  frame 0
 TX packets 714034  bytes 125963495 (120.1 MiB)
 TX errors 0  dropped 806 overruns 0  carrier 0  collisions 0

Run the command again and note the change in the TX dropped count that is displayed when you run the ifconfig command, or in the TX-DRP count that is displayed when you run the netstat command.

If the value is continuously increasing, there is an MTU issue. To resolve it, lower the MTU settings of the tunnel and Pod interfaces, based on your network characteristics.
Complete the following steps to change the Calico IP-IP tunnel MTU after it is deployed:
1. Update the setting for veth_mtu and tunnel_mtu by running the following command:
```
kubectl edit cm calico-config -n kube-system
```
  Change the settings to the following values:
```
tunnel_mtu: "1400"
veth_mtu: "1400"
```
2. Restart the calico-node PODs for the changes to take effect by entering the following command:
```
kubectl patch ds calico-node -n kube-system -p '{"spec":{"template":{"spec":{"containers":[{"name":"calico-node","env":[{"name":"RESTART_","value":"'$(date +%s)'"}]}]}}}}'
```

The web-terminal alerts user with `goodbye` if no permission is assigned

The web terminal does not work for users who do not have permission to at least one namespace. The result is early termination with goodbye displayed. You need access to a namespace to access the web terminal. See a cluster administrator for access.

Permission issue with Docker Version 18.03 with Ubuntu 16.04 LTS

If you use Docker Version 18.03 or higher with Ubuntu 16.04 LTS, containers that run as non-root might have permission issues. This issue appears to be due to a problem between the overlay storage driver and the kernel.

Pod goes into a CrashLoopBackOff state when a modified subpath configmap mount fails

A pod goes into a CrashLoopBackOff state during the restart of the Docker service on a worker node. If you run the kubectl get pods command to check the pod that is in the CrashLoopBackOff state, you get the following error message:

level=error msg="Handler for POST /v1.31/containers/4a46aa25deac4af4bf60813d2c763e54499c0d12b9cd28b3d1990843e1e6c3d5/start returned error: OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"rootfs_linux.go:58: mounting \\\\\\\"/var/lib/kubelet/pods/78db74ec-2a7b-11ea-8ada-72f600a81a05/volume-subpaths/app-keystore/ich-mobilebanking-secured/6\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay2/31f618303ba3762398bbee05e51657c0f62f096d4cc9f600fb3ec18f047f94ba/merged\\\\\\\" at \\\\\\\"/var/lib/docker/overlay2/31f618303ba3762398bbee05e51657c0f62f096d4cc9f600fb3ec18f047f94ba/merged/opt/ibm/wlp/usr/servers/defaultServer/resources/security/app-truststore.jks\\\\\\\" caused \\\\\\\"no such file or directory\\\\\\\"\\\"\": unknown"

The CrashLoopBackOff state pod that has a subPath cannot be recovered if its volume contents are changed.

To recover the pod, delete the pod that has the CrashLoopBackOff error by using the following commands. When you delete the pod, the pod re-creates in a good state.

Get information about the pods that are in CrashLoopBackOff state.
```
kubectl get pods -n <namespace> | grep CrashLoopBackOff
```
Delete the pod that has the CrashLoopBackOff error.
```
kubectl delete pod <pod_name> -n <namespace>
```

Cluster management

Parameter kube-host for cloudctl MC cluster import does not work
'Delete cluster' button does not work for IBM Cloud Private with OpenShift environments on Linux on Power (ppc64le)

Parameter `kube-host` for `cloudctl mc cluster import` does not work

For current version with no fix pack, there is no work around.

'Delete cluster' button does not work for IBM Cloud Private 3.2.x fix pack versions on Linux on Power (ppc64le)

After you install or upgrade to fix pack 3.2.1.2003 or newer 3.2.1.x fix pack or to fix pack 3.2.2.2008 for an IBM Cloud Private environment on Linux on Power (ppc64le) that includes IBM Multicloud Manager installed, the 'Delete cluster' button on the Clusters dashboard in the management console does not work for deleting a cluster.

As an alternative, you can delete a cluster by using the cloudctl and kubectl CLI tools.

Management console

The management console displays 502 Bad Gateway Error
Truncated labels are displayed on the dashboard for some languages
Visual Web Terminal is not working in the Microsoft Edge browser
If clusters contain a large number of namespaces, a cluster administrator might not view certain resources
Visual Web Terminal might not display a large number of returned items

The management console displays 502 Bad Gateway Error

The management console displays a 502 Bad Gateway Error after installing or rebooting the master node.

If you recently installed IBM Cloud Private, wait a few minutes and reload the page.

If you rebooted the master node, take the following steps:

Obtain the IP addresses of the icp-ds pods. From the master node, run the following command:
```
kubectl get pods -o wide  -n kube-system | grep "icp-ds"
```
The output resembles the following text:
```
icp-ds-0                                                  1/1       Running       0          1d        10.1.231.171   10.10.25.134
```
In this example, 10.1.231.171 is the IP address of the pod.

In high availability (HA) environments, an icp-ds pod exists for each master node.
From the master node, ping the icp-ds pods. Check the IP address for each icp-ds pod by running the following command for each IP address:
```
ping 10.1.231.171
```
If the output resembles the following text, you must delete the pod:
```
connect: Invalid argument
```
From the master node, delete each pod that is unresponsive by running the following command:
```
 kubectl delete pods icp-ds-0 -n kube-system
```
In this example, icp-ds-0 is the name of the unresponsive pod.

Important: In HA installations, you might have to delete the pod for each master node.

From the master node, obtain the IP address of the replacement pod or pods by running the following command:

kubectl get pods -o wide  -n kube-system | grep "icp-ds"

The output resembles the following text:

icp-ds-0                                                  1/1       Running       0          1d        10.1.231.172   10.10.2

From the master node, ping the pods again and check the IP address for each icp-ds pod by running the following command for each IP address:
```
ping 10.1.231.172
```
If all icp-ds pods are responsive, you can access the IBM Cloud Private management console when that pod enters the available state.

Truncated labels are displayed on the dashboard for some languages

If you access the IBM Cloud Private dashboard in languages other than English from the Mozilla Firefox browser on a system that uses a Windows™ operating system, some labels might be truncated.

If cluster contains a large number of namespaces, a cluster administrator might not view certain resources

Accessing management console pages that contain all namespaces in the dropdown menu might cause an error. For example, Daemonsets, Deployments, Resource quotas, and similar pages might fail because an increased number of namespaces, such as 100 or 200, are tied to an increased number of resources. To view these resources, use the Search feature or run kubectl.

For example, to use Search as a workaround to view resources quotas for all namespaces, search for kind:resourcequota. Search is also available for other resources, such as kind: deployments, and daemonsets.

To use kubectl, run commands similar to the following:

  kubectl get quota --all-namespaces

  kubectl get Resourcequota --all-namespaces

Visual Web Terminal might not display a large number of returned items

The limit for the returned content of a command that is issued with the Visual Web Terminal is 200 KB. If the returned information exceeds 200 KB, an error is displayed. The workaround is to enter the command using a terminal window that is outside of the Visual Web Terminal.

IBM Cloud Private CLI (cloudctl)

IAM resource that was added with the CLI is overwritten by the management console

IAM resource that was added from the CLI is overwritten by the management console]

If you update a team resource that has a Helm release resource that is assigned to it from the command line interface (CLI) and from the management console, then the resource is unassigned. If you manage Helm release resources, add the resource from the CLI. If you manage Helm release resources from the management console, you might notice that a Helm release resource is incorrectly listed as a Namespace. For more information, see Managing Helm releases.

Manage your Helm release resource from the CLI for the most accurate team resource information. For more information, see Working with charts.

Known limitations of IBM Cloud Private on Linux on IBM Z and LinuxONE

The IBM Cloud Private on Linux on IBM Z and LinuxONE has the following limitations:

Mixed architecture such as master node on the IBM Z or LinuxONE and worker or proxy nodes on Linux or Linux® on Power® (ppc64le) cannot be used in a production environment because it is a technology preview feature.
For a list of supported platforms and features, see Supported operating systems and platforms.
For the hardware requirements for Linux on IBM Z and LinuxONE environment, see Linux on IBM Z and LinuxONE environment requirements.

Cannot deploy more than one instance of IBM Cloud Private Certificate Manager (cert-manager)

Only one instance of cert-manager is deployed by default. If there is more than one instance (more than one pod, for example) deployed, remove the extra instances, otherwise cert-manager is not able to function properly.

Known issues and limitations

Installation, configuration, and upgrade

IBM Multicloud Manager endpoint pods not running properly during fresh installation

Adding node to cluster fails

Container fails to start due to Docker issue

Statefulset pod fails to mount the volume on IBM Cloud Private on OpenShift Container Platform PowerVC cluster in Linux on Power environment

Certificate errors occur when you upgrade with kubelet dynamic configuration enabled

After an upgrade a kubelet use of closed network connection error causes pods to remain in terminating status

IBM Cloud Private version is not updated after rollback

After an upgrade Vulnerability Advisor pod in crashbackoff when updating the default policy

multicluster-hub-etcd and multicluster-hub-core pods fail to start after an install or upgrade

Security and compliance

Vulnerability Advisor cross-architecture image scanning does not work with glibc version earlier than 2.22

Vulnerability Advisor policy resets to default setting after upgrade from 3.2.0 in ppc64le cluster

ACME HTTP issuer cannot issue certificates in OpenShift clusters

ACME HTTP issuer image is not copied to the worker nodes

Cannot get secret by using kubectl command when encryption of secret data at rest is enabled

LDAP user names are case-sensitive

Vulnerability Advisor cannot scan unsupported container images

Image security enforcement is only supported by IBM Multicloud Manager registries

Network

Cookie affinity does not work when FIPS is enabled

IPv6 is not supported

Calico prefix limitation on Linux® on Power® (ppc64le) nodes

Enable Ingress Controller to use a new annotation prefix

Intermittent failure when you log in to the management console in HA clusters that use NSX-T 2.3 or 2.4

Encrypting cluster data network traffic with IPsec does not work on SLES 12 SP3 operating system

Elasticsearch does not work with GlusterFS

NGINX ingress rewrite-target annotation fails when you upgrade to IBM Cloud Private Version 3.2.1

Pods not reachable from NGINX ingress controller in OpenShift Version 3.11 multitenant mode

kube-dns pods are not evicted even though the node is down

Catalog

Namespace deletion fails when Catalog is enabled

Target cluster is required message is displayed when deploying a Helm chart in the Catalog

Storage

Cannot restart node when using vSphere storage that has no replica

GlusterFS cluster becomes unusable if you configure a vSphere Cloud Provider after installing IBM Cloud Private

live-crawler pods reach the memory resource limit on Linux on Power (ppc64le)

Monitoring and logging

Elasticsearch type mapping limitations

Alerting, logging, or monitoring pages displays 500 Internal Server Error

Monitoring data is not retained if you use a dynamically provisioned volume during upgrade

Prometheus data source is lost during a rollback of IBM Cloud Private

Logs not working after logging pods are restarted

getFreePhysicalMemorySize() is not supported in IBM® Z platforms

custom-metrics-adapter data does not match the CPU and memory usage for pods

Prometheus and Alertmanager pod recovery when node crashes

Rolling back to older versions

CrashLoopBackOff status for monitoring-prometheus-operator pod in OpenShift environment

Prometheus and Alertmanager persistent volumes remain in Pending status with GlusterFS

VolumeMounts error when you upgrade a monitoring chart

Platform management

Resource quota might not update

The Key Management Service must deploy to a management node in a Linux® platform

Synchronizing repositories might not update Helm chart contents

Helm repository names cannot contain DBCS GB18030 characters

Container fails to operate or a kernel panic occurs

Timeouts and blank screens when displaying more than 80 namespaces

Cloning an IBM Cloud Private worker node is not supported

LDAP search does not automatically show suggestions on keypress

Pod liveness or readiness check might fail because Docker failed to run some commands in the container

Pods show CreateContainerConfigError

Some Pods not starting or log TLS handshake errors in IBM Power environment

The web-terminal alerts user with goodbye if no permission is assigned

Permission issue with Docker Version 18.03 with Ubuntu 16.04 LTS

Pod goes into a CrashLoopBackOff state when a modified subpath configmap mount fails

Cluster management

Parameter kube-host for cloudctl mc cluster import does not work

'Delete cluster' button does not work for IBM Cloud Private 3.2.x fix pack versions on Linux on Power (ppc64le)

Management console

The management console displays 502 Bad Gateway Error

Truncated labels are displayed on the dashboard for some languages

If cluster contains a large number of namespaces, a cluster administrator might not view certain resources

Visual Web Terminal might not display a large number of returned items

IBM Cloud Private CLI (cloudctl)

IAM resource that was added from the CLI is overwritten by the management console]

Known limitations of IBM Cloud Private on Linux on IBM Z and LinuxONE

Cannot deploy more than one instance of IBM Cloud Private Certificate Manager (cert-manager)

Vulnerability Advisor cross-architecture image scanning does not work with `glibc` version earlier than 2.22

`getFreePhysicalMemorySize()` is not supported in IBM® Z platforms

`CrashLoopBackOff` status for `monitoring-prometheus-operator` pod in OpenShift environment

`VolumeMounts` error when you upgrade a monitoring chart

The web-terminal alerts user with `goodbye` if no permission is assigned

Parameter `kube-host` for `cloudctl mc cluster import` does not work