Known issues and limitations

Review the known issues for version 3.1.1.

Logs older than what is specified in log retention policy are recreated if Filebeat is restarted
Elasticsearch type mapping limitations
Metering data shows large numbers for available and capped cores
Kubernetes API Server vulnerability
Resource quota might not update
Container fails to start due to Docker issue
The Key Management Service must deploy to a management node in a Linux® x86_64 platform
Dynamic configuration limitation on Linux® on Power® (ppc64le) and IBM® Z nodes
Cookie affinity does not work when FIPS is enabled
Sticky sessions must be manually set on Linux® on IBM® Z and LinuxONE and Linux® on Power® (ppc64le)
The Grafana UI cannot be opened after upgrading the monitoring service release version
Tiller 2.7.2 does not support the upgrade or install of Kubernetes 1.10 resources
Alerting, logging, or monitoring pages displays 500 Internal Server Error
IPv6 is not supported
Cannot log in to the management console with an LDAP user after restarting the leading master
Calico prefix limitation on Linux® on Power® (ppc64le) nodes
StatefulSets remain in Terminating state after a worker node shuts down
Syncing repositories might not update Helm chart contents
Some features are not available from the new management console
Containers fail to start or a kernel panic occurs
The management console displays 502 Bad Gateway Error
Enable Ingress Controller to use a new annotation prefix
Monitoring data is not retained if you use a dynamically provisioned volume during upgrade
Cannot restart node when using vSphere storage
Truncated labels are displayed on the dashboard for some languages
Helm repository names cannot contain DBCS GB18030 characters
GlusterFS cluster becomes unusable if you configure a vSphere Cloud Provider after installing IBM Cloud Private
A failed upgrade or rollback of IBM Cloud Private creates two release entries with different statuses
Prometheus data source is lost during a rollback of IBM Cloud Private
Vulnerability Advisor cross-architecture image scanning does not work with glibc version earlier than 2.22
Container fails to operate or a kernel panic occurs
Intermittent failure when you log in to the management console in HA clusters that use NSX-T 2.3
Vulnerability Advisor policy resets to the default setting after you upgrade from 3.1.0 in ppc64le cluster
Containers can crash when running IBM Cloud Private on KVM on Power guests.
Linux kernel memory leak
In an NSX-T environment, when you restart a master node, the management console becomes inaccessible.
Logging ELK pods are in CrashLoopBackOff state
Logs not working after logging pods are restarted
Deploying contents in the default namespace might not be secure
Timeouts and blank screens when displaying 80+ namespaces
Required to create cluster roles for custom pod security policies before associating them to new namespaces
Encrypting cluster data network traffic with IPsec does not work on SLES 12 SP3 operating system
A namespace is stuck in the Terminating state
Cloning an IBM Cloud Private worker node is not supported
Some Pods not starting or log TLS handshake errors
Installation can fail with a helm-api setup error
Certain cloudctl cm commands may not work accurately. Use kubectl instead
Cannot get secret by using kubectl command when encryption of secret data at rest is enabled
Vulnerability Advisor cannot scan unsupported container images

Logs older than what is specified in log retention policy are recreated if Filebeat is restarted

A curator background job is deployed as part of the IBM Cloud Private logging service. To free disk space, the job runs once a day to remove old log data based on your retention settings.

If Filebeat pods are restarted, Filebeat finds all existing log files, and reprocesses and reingests them. This activity includes log entries that are older than what is specified by the log retention policy. This behavior can cause older logs to be reindexed to Elasticsearch, and appear in the logging console until the curator job executes again. If this is problematic, you can manually delete indices older than your retention settings. For more information, see Manually removing log indices.

Elasticsearch type mapping limitations

The IBM Cloud Private logging component uses Elasticsearch to store and index logs that are received from all the running containers in the cluster. If containers emit logs in JSON format, each field in the JSON is indexed by Elasticsearch to allow queries to use the fields. However, if two containers define the same field while they send different data types, Elasticsearch is not able to index the field correctly. The first type that is received for a field each day sets the accepted type for the rest of the day. This action results in two problems:

In IBM Cloud Private version 3.1.2 and earlier, log messages with non-matching types are discarded. In IBM Cloud Private version 3.2.0 and later, the log messages are accepted but the non-matching fields are not indexed. If you run a query using that field, you do not find the non-matching documents. Some scenarios primarily involving fields that are sometimes objects can still result in discarded log messages. For more information, see Elasticsearch issue 12366 .
If the type for a field is different over several days, queries from Kibana can result in errors such as 5 of 30 shards failed. To work around this issue, complete the following steps to force Kibana to recognize the type mismatch:
- From the Kibana navigation menu, click Management
- Select Index patterns
- Click Refresh field list

Metering data shows large numbers for available and capped cores

Metering service can report a large number when the number of available cores that are allocated to a node is not in whole-core increments. In some clusters, you might decide to reserve some number of CPU cores for system processes. This action can cause partial cores to be available for workloads on a node. When this event happens, the metering service reports many available and capped processors (by a factor of 1000). For instance, 7.2 cores are metered as 7200 cores. To work around this issue, reserve whole-core increments. Alternatively, contact IBM support.

Kubernetes API Server vulnerability

IBM Cloud Private has a patch (icp-3.1.1-build508531) on IBM® Fix Central to address the Kubernetes security vulnerability, where the proxy request handling in the Kubernetes API Server can leave vulnerable TCP connections. For full details, see the Kubernetes kube-apiserver vulnerability issue Opens in a new tab . After you apply the patch, you do not need to redeploy either IBM Cloud Private or your Helm releases. You must reapply the patch if you replace your master node.

Resource quota might not update

You might find that the resource quota is not updating in the cluster. This is due to an issue in the kube-controller-manager. The workaround is to stop the kube-controller-manager leader container on the master nodes and let it restart. If high availability is configured for the cluster, you can check the kube-controller-manager log to find the leader. Only the leader kube-controller-manager is working. The other controllers wait to be elected as the new leader once the current leader is down.

For example:

# docker ps | grep hyperkube | grep controller-manager
97bccea493ea        4c7c25836910                                                                                              "/hyperkube controll…"   7 days ago          Up 7 days                               k8s_controller-manager_k8s-master-9.111.254.104_kube-system_b0fa31e0606015604c409c09a057a55c_2

To stop the leader, run the following command with the ID of the Docker process:

docker rm -f 97bccea493ea

Container fails to start due to Docker issue

Installation fails during container creation due to a Docker 18.03.1 issue. If you have a subPath in the volume mount, you might receive the following error from the kubelet service, which fails to start the container:

Error: failed to start container "heketi": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/7e9cb34c-b2bf-11e8-a9eb-0050569bdc9f/volume-subpaths/heketi-db-secret/heketi/0\\\" to rootfs \\\"/var/lib/docker/overlay2/ca0a54812c6f5718559cc401d9b73fb7ebe43b2055a175ee03cdffaffada2585/merged\\\" at \\\"/var/lib/docker/overlay2/ca0a54812c6f5718559cc401d9b73fb7ebe43b2055a175ee03cdffaffada2585/merged/backupdb/heketi.db.gz\\\" caused \\\"no such file or directory\\\"\"": unknown

For more information, see the Kubernetes documentation Opens in a new tab .

To resolve this issue, delete the failed pod and try the installation again.

The Key Management Service must deploy to a management node in a Linux® x86_64 platform

The Key Management Service is deployed to the management node and is supported only on the Linux® x86_64 platform. If there is no amd64 management node in the cluster, the Key Management Service is not deployed.

Dynamic configuration limitation on Linux® on Power® (ppc64le) and IBM® Z nodes

For IBM Cloud Private version 3.1.0, NGINX ingress controller is upgraded to version 0.16.2. Because LuaJIT is not available on IBM® Z (s390x) and Linux® on Power® (ppc64le) architectures, the NGINX controller will disable the dynamic configuration features during startup. For more details, see https://github.com/kubernetes/ingress-nginx/blob/nginx-0.16.2/cmd/nginx/flags.go#L217-L224 Opens in a new tab .

Cookie affinity does not work when FIPS is enabled

When Federal Information Processing Standard (FIPS) is enabled, cookie affinity does not work because nginx.ingress.kubernetes.io/session-cookie-hash can be set only to sha1/md5/index, which is not supported in FIPS mode.

Sticky sessions must be manually set on Linux® on IBM® Z and LinuxONE and Linux® on Power® (ppc64le)

Because LuaJIT is unavailable, Session Affinity is handled by the nginx-sticky-module-ng module. You must enable sticky session manually. For more information see, Cannot set the cookie for sticky sessions Opens in a new tab .

The Grafana UI cannot be opened after upgrading the monitoring service release version

If persistent volume is enabled for the monitoring service and the Grafana password is not set during the monitoring service installation, after upgrading to a later version, the Grafana dashboard is not accessible. You can see the error message {"message":"Invalid username or password"} when trying to access the Grafana dashboard. You can resolve this problem either before you upgrade the monitoring release or after upgrading.

Pre-upgrade steps

Obtain the Grafana password from monitoring-grafana-secret:

export PASSWORD=$(kubectl get -n kube-system secret/monitoring-grafana-secret -o yaml|grep password|awk -F': ' '{print $2}'|base64 -d)

Set the Grafana password during upgrade. If you upgrade by using the IBM Cloud Private dashboard, set the Grafana password in the configuration page during upgrade. If you upgrade by using the Helm CLI, follow these steps:
1. Obtain the values.yaml of the existing release:
```
helm get --tls monitoring >> values.yaml
```
2. Insert the Grafana password from Step 1 into the values.yaml:
```
grafana:
password: PASSWORD
```
3. Run the Helm upgrade command:
```
helm upgrade --tls monitoring -f values.yaml ibm-icpmonitoring-1.3.0.tar.gz
```

Post-upgrade steps

Obtain the Grafana password from monitoring-grafana-secret:
```
export PASSWORD=$(kubectl get -n kube-system secret/monitoring-grafana-secret -o yaml|grep password|awk -F': ' '{print    $2}'|base64 -d)
```
Note: Option -d for the base64 command is used to decode the encoded password. The option might vary on different operation systems. For example, in macOS, the option is -D.

Obtain the Grafana pod name:

export GRAFANA_POD=$(kubectl get pod -n kube-system|grep grafana|grep Running|awk '{split($0, a, " "); print a[1]}')

Reset the password for the Grafana container:

kubectl exec -n kube-system $GRAFANA_POD -c grafana -it -- grafana-cli admin reset-admin-password --homepath "/usr/share/grafana" $PASSWORD

Note: If you resolve this problem post-upgrade, the problem might occur again after rollback. You must follow the same post-upgrade steps to fix it.

Tiller 2.7.2 does not support the upgrade or install of Kubernetes 1.8 resources

Tiller version 2.7.2 is installed with IBM Cloud Private version 3.1.1. Tiller 2.7.2 uses Kubernetes API version 1.7. You cannot install or upgrade Helm charts that use only Kubernetes version 1.8 resources.

You might encounter a Helm release upgrade error. The error message will resemble the following content:

Error: UPGRADE FAILED: failed to create patch: unable to find api field in struct Unstructured for the json field "spec"

If you encounter this error message, you must delete the release and install a new version of the chart.

Alerting, logging, or monitoring pages displays 500 Internal Server Error

To resolve this issue, complete the following steps from the master node:

Create an alias for the insecure kubectl api log in by running the following command:
```
alias kc='kubectl -n kube-system'
```

Edit the configuration map for Kibana. Run the following command:

kc edit cm kibana-nginx-config

Add the following updates:

 upstream kibana {
 server localhost:5602;
 }
 Change localhost to 127.0.0.1

Locate and restart the Kibana pod by running the following commands:
```
 kc get pod | grep -i kibana
```
```
 kc delete pod <kibana-POD_ID>
```

Edit the configuration map for Grafana by running the following command:

kc edit cm grafana-router-nginx-config

Add the following updates:

upstream grafana {
server localhost:3000;
}
Change localhost to 127.0.0.1

Locate and restart the Grafana pod by running the following commands:

kc get pod | grep -i monitoring-grafana

kc delete pod <monitoring-grafana-POD_ID>

Edit the configuration map for the Alertmanager by running the following command:

kc edit cm alertmanager-router-nginx-config

Add the following updates:

upstream alertmanager {
server localhost:9093;
}
Change localhost to 127.0.0.1

Locate and restart the Alertmanager by running the following commands:

kc get pod | grep -i monitoring-prometheus-alertmanager

kc delete pod <monitoring-prometheus-alertmanager-POD_ID>

IPv6 is not supported

IBM Cloud Private cannot use IPv6 networks. Comment out the settings in the /etc/hosts file on each cluster node to remove the IPv6 settings. For more information, see Configuring your cluster.

Cannot log in to the management console with an LDAP user after restarting the leading master

If you cannot log in to the management console after you restart the leading master node in a high availability cluster, take the following actions:

Log in to the management console with the cluster administrator credentials. The user name is admin, and the password is admin.
Click Menu > Manage > Identity & Access.
Click Edit and then click Save.

Note: LDAP users can log in to the management console.

If the problem persists, MongoDB, MariaDB, and the pods that depend on auth-idp might not be running. Follow these instructions to identify the cause.

Check whether MongoDB and MariaDB pods are running without any errors.
- Use the following command to check the pod status. All pods must show the status as 1/1 Running. Check the logs, if required.
```
kubectl -n kube-system get pods | grep -e mariadb -e mongodb
```
- If the pods do not show the status as 1/1 Running, restart all the pods by deleting them.
```
kubectl -n kube-system delete pod -l k8s-app=mariadb
```
```
kubectl -n kube-system delete pod -l app=icp-mongodb
```
  Wait for a minute or two for the pods to restart. Check the pod status by using the following command. The status must show 1/1 Running.
```
  kubectl -n kube-system get pods | grep -e mariadb -e mongodb
```
After the MongoDB and MariaDB pods are running, restart the auth-idp pods by deleting them.
```
kubectl -n kube-system delete pod -l k8s-app=auth-idp
```
Wait for a minute or two for the pods to restart. Check the pod status by using the following command. The status must show 4/4 Running.
```
 kubectl -n kube-system get pods | grep auth-idp
```

Calico prefix limitation on Linux® on Power® (ppc64le) nodes

If you install IBM Cloud Private on PowerVM Linux LPARs and your virtual Ethernet devices use the ibmveth prefix, you must set the network adapter to use Calico networking. During installation, be sure to set a calico_ip_autodetection_method parameter value in the config.yaml file. The setting resembles the following content:

calico_ip_autodetection_method: interface=<device_name>

The <device_name> parameter is the name of your network adapter. You must specify the ibmveth0 interface on each node of the cluster, including the worker nodes.

Note: If you used PowerVC to deploy your cluster node, this issue does not affect you.

StatefulSets remain in Terminating state after a worker node shuts down

If the node where the StatefulSet pod is running shut down, the pod for the StatefulSet enters a Terminating state. You must manually delete the pod that is stuck in the Terminating state to force it to re-create on another node.

To delete the pod, run the following command:

kubectl -n <namespace> delete pods --grace-period=0 --force <pod_name>

For more information about Kubernetes pod safety management, see Pod Safety, Consistency Guarantees, and Storage Implications Opens in a new tab in the Kubernetes community feature specs.

Synchronizing repositories might not update Helm chart contents

Synchronizing repositories takes several minutes to complete. While synchronization is in progress, there might be an error if you try to display the readme file. After synchronization completes, you can view the readme file and deploy the chart.

Some features are not available from the new management console

IBM Cloud Private 3.1.1 supports the new management console only. Some options from the previous console are not yet available. To access the options from the previous console you must use the kubectl CLI for the functions.

Containers fail to start or a kernel panic occurs

For Red Hat Enterprise Linux (RHEL) only: Containers fail to start or a kernel panic occurs and a no space left on device error message is displayed for Red Hat Enterprise Linux (RHEL) only. This issue is a known Docker engine issue that is caused by the leaking of cgroups. For more information about this issue, see https://github.com/moby/moby/issues/29638 Opens in a new tab and https://github.com/kubernetes/kubernetes/issues/61937 .

To fix this issue, you must restart the host.

The management console displays 502 Bad Gateway Error

The management console displays a 502 Bad Gateway Error after installing or rebooting the master node.

If you recently installed IBM Cloud Private, wait a few minutes and reload the page.

If you rebooted the master node, take the following steps:

Obtain the IP addresses of the icp-ds pods. From the master node, run the following command:
```
kubectl get pods -o wide  -n kube-system | grep "icp-ds"
```
The output resembles the following text:
```
icp-ds-0                                                  1/1       Running       0          1d        10.1.231.171   10.10.25.134
```
In this example, 10.1.231.171 is the IP address of the pod.

In high availability (HA) environments, an icp-ds pod exists for each master node.
From the master node, ping the icp-ds pods. Check the IP address for each icp-ds pod by running the following command for each IP address:
```
ping 10.1.231.171
```
If the output resembles the following text, you must delete the pod:
```
connect: Invalid argument
```
From the master node, delete each pod that is unresponsive by running the following command:
```
 kubectl delete pods icp-ds-0 -n kube-system
```
In this example, icp-ds-0 is the name of the unresponsive pod.

Important: In HA installations, you might have to delete the pod for each master node.

From the master node, obtain the IP address of the replacement pod or pods by running the following command:

kubectl get pods -o wide  -n kube-system | grep "icp-ds"

The output resembles the following text:

icp-ds-0                                                  1/1       Running       0          1d        10.1.231.172   10.10.2

From the master node, ping the pods again and check the IP address for each icp-ds pod by running the following command for each IP address:
```
ping 10.1.231.172
```
If all icp-ds pods are responsive, you can access the IBM Cloud Private management console when that pod enters the available state.

Enable Ingress Controller to use a new annotation prefix

The NGINX ingress annotation contains a new prefix in version 0.9.0 that is used in IBM Cloud Private 3.1.1 nginx.ingress.kubernetes.io. This change uses the flag to avoid breaks to deployments that are running.
- To avoid breaking a running NGINX ingress controller, add the --annotations-prefix=ingress.kubernetes.io flag to the nginx ingress controller deployment. The product accepts the flag by default in IBM Cloud Private ingress controller.
If you want to use the new ingress annotation, update the ingress controller by removing the --annotations-prefix=ingress.kubernetes.io flag. To remove the flag run the following commands:

Note: Run the following commands from the master node.

For Linux® x86_64, run the following command:
```
 kubectl edit ds nginx-ingress-lb-amd64 -n kube-system
```
For Linux® on Power® (ppc64le) run the following command:
```
 kubectl edit ds nginx-ingress-lb-ppc64le -n kube-system
```
Save and exit to implement the change. Ingress controller restarts to receive the new configuration.

Monitoring data is not retained if you use a dynamically provisioned volume during upgrade

If you use a dynamically provisioned persistent volume to store monitoring data, the data is lost after you upgrade the monitoring service from 2.1.0.2 to 2.1.0.3.

Cannot restart node when using vSphere storage

Shutting down a cluster in an IBM Cloud Private environment that uses vSphere Cloud moves the pod to another node in your cluster. However, the vSphere volume that the pod uses on the original node is not detached from the node. An error might occur when you try to restart the node.

To resolve the issue, first detach the volume from the node. Then, restart the node.

Truncated labels are displayed on the dashboard for some languages

If you access the IBM Cloud Private dashboard in languages other than English from the Mozilla Firefox browser on a system that uses a Windows™ operating system, some labels might be truncated.

Helm repository names can not contain DBCS GB18030 characters

Do not use DBCS GB18030 characters in the Helm repository name when you add the repository.

GlusterFS cluster becomes unusable if you configure a vSphere Cloud Provider after installing IBM Cloud Private

By default, the kubelet uses the IP address of the node as the node name. When you configure a vSphere Cloud Provider, kubelet uses the host name of the node as the node name. If you had your GlusterFS cluster set up during installation of IBM Cloud Private, Heketi creates a topology by using the IP address of the node.

When you configure a vSphere Cloud Provider after you install IBM Cloud Private, your GlusterFS cluster becomes unusable because the kubelet identifies nodes by their host names, but Heketi still uses IP addresses to identify the nodes.

If you plan to use both GlusterFS and a vSphere Cloud Provider in your IBM Cloud Private cluster, ensure that you set kubelet_nodename: hostname in the config.yaml file during installation.

A failed upgrade or rollback of IBM Cloud Private creates two release entries with different statuses

A failed upgrade or rollback results in two listed releases with the same name: one successful release that has not been upgraded or rolled back, and the failed upgraded or rolled back release.

These two releases with the same name are two revisions of the same release, so deleting one deletes the other. This issue of showing more than one revision of a release is a known community Helm 2.7.2 issue. For more information, see https://github.com/kubernetes/helm/issues/2941.

Prometheus data source is lost during a rollback of IBM Cloud Private

When you roll back from IBM Cloud Private Version 3.1.1 to 3.1.0, the Prometheus data source in Grafana is lost. The Grafana dashboards do not display any metric.

To resolve the issue, add back the Prometheus data source by completing the steps in the Manually configure a Prometheus data source in Grafana section.

Vulnerability Advisor cross-architecture image scanning does not work with `glibc` version earlier than 2.22

Vulnerability Advisor (VA) now supports cross-architecture image scanning with QEMU (Quick EMUlator). You can scan Linux® on Power® (ppc64le) CPU architecture images with VA running on Linux® x86_64 nodes. Alternatively, you can scan Linux® x86_64 CPU architecture images with VA running on Linux® on Power® (ppc64le) nodes.

When scanning Linux® x86_64 images, you must use glibc version 2.22 or later. If you use glibc version earlier than 2.22, the scan might not work when VA runs on Linux® on Power® (ppc64le) nodes. Glibc versions earlier than 2.22 make certain syscalls (time/vgetcpu/getttimeofday) by using vsyscall mechanisms. The syscall implementation attempts to access hardcoded static address, which QEMU fails to translate while running in emulation mode.

Container fails to operate or a kernel panic occurs

The following error might occur from the IBM Cloud Private node console or kernel log:

  kernel:unregister_netdevice: waiting for <eth0> to become free.

If you receive this error, the log displays both kernal:unregister_netdevice: waiting for <eth0> to be free and containers fail to operate. Continue to troubleshoot. If you meet all required conditions, reboot the node.

View https://github.com/kubernetes/kubernetes/issues/64743 Opens in a new tab to learn about the Linux Kernel bug that causes the error.

Intermittent failure when you log in to the management console in HA clusters that use NSX-T 2.3

In HA clusters that use NSX-T 2.3, you might not be able to log in to the management console. After you specify the login credentials, you are redirected to the login page. You might have to try logging in multiple times until you succeed. This issue is intermittent.

Vulnerability Advisor policy resets to default setting after upgrade from 3.1.0 in ppc64le cluster

If you enabled Vulnerability Advisor (VA) on your Linux® on Power® (ppc64le) cluster in 3.1.0, the Vulnerability Advisor policy resets to the default setting when you upgrade to 3.1.1. To fix this issue, reset the VA policy in the management console.

Containers can crash when running IBM Cloud Private on KVM on POWER guests.

If you are running IBM Cloud Private on KVM on Power guests, some containers might crash because of an issue with how the Transaction Memory is handled. You can work around this issue by using one of the following methods:

Turn off the Transaction Memory support for KVM on Power guests.
If you are using the Oemu emulator directly to run the virtual machine, enable the cap-htm=off option.
If you are using the libvirt library, add the following XML attribute to the domain definition:
```
<features>
  <htm state='on'/>
</features>
```
See the libvirt documentation for the detailed instructions about adding this libvirt attribute. Note: This issue is specific to KVM on Power guests and does not occur when using POWER9 bare metal or POWER9 PowerVM LPARs.

Linux kernel memory leak

Linux kernels older than release 4.17.17 contain a bug that causes kernel memory leaks in cgroup (community link). When pods in the host are restarted multiple times, the host can run out of kernel memory. This problem causes pod start failures and hung systems.

As shown in the following example, you can check your kernel core dump file and view the core stack:

[700556.898399] Call Trace:
[700556.898406]  [<ffffffff8184bdb0>] ? bit_wait+0x60/0x60
[700556.898408]  [<ffffffff8184b5b5>] schedule+0x35/0x80
[700556.898411]  [<ffffffff8184e746>] schedule_timeout+0x1b6/0x270
[700556.898415]  [<ffffffff810f90ee>] ? ktime_get+0x3e/0xb0
[700556.898417]  [<ffffffff8184bdb0>] ? bit_wait+0x60/0x60
[700556.898420]  [<ffffffff8184ad24>] io_schedule_timeout+0xa4/0x110
[700556.898422]  [<ffffffff8184bdcb>] bit_wait_io+0x1b/0x70
[700556.898425]  [<ffffffff8184b95f>] __wait_on_bit+0x5f/0x90
[700556.898429]  [<ffffffff8119200b>] wait_on_page_bit+0xcb/0xf0
[700556.898433]  [<ffffffff810c6de0>] ? autoremove_wake_function+0x40/0x40
[700556.898435]  [<ffffffff81192123>] __filemap_fdatawait_range+0xf3/0x160
[700556.898437]  [<ffffffff811921a4>] filemap_fdatawait_range+0x14/0x30
[700556.898439]  [<ffffffff8119414f>] filemap_write_and_wait_range+0x3f/0x70
[700556.898444]  [<ffffffff8129af08>] ext4_sync_file+0x108/0x350
[700556.898447]  [<ffffffff812486de>] vfs_fsync_range+0x4e/0xb0
[700556.898449]  [<ffffffff8124879d>] do_fsync+0x3d/0x70
[700556.898451]  [<ffffffff81248a63>] SyS_fdatasync+0x13/0x20
[700556.898453]  [<ffffffff8184f788>] entry_SYSCALL_64_fastpath+0x1c/0xbb
[700599.233973] mptscsih: ioc0: attempting task abort! (sc=ffff880fd344e100)

To work around the failures, you can restart the host. However, you might encounter the problem again. To avoid the problem, it is recommended that you upgrade your Linux kernel to release 4.17.17 or higher. Release 4.17.17 contains fixes for the kernel bug.

View Changing the cgroup driver to systemd on Red Hat Enterprise Linux on IBM Cloud Private troubleshoot page for more information.

In an NSX-T environment, when you restart a master node, the management console becomes inaccessible.

In an NSX-T environment, when you restart a master node, the management console is inaccessible even though all the service pods are in a good state. This issue is caused because of non-persistent IPtable NAT rules, which help host port and pod communication through host IP. NSX-T does not support host port. IBM Cloud Private uses host port for management console.

To resolve the issue, run the following commands on all the master nodes. Use the network CIDR that you specified in the /<installation_directory>/cluster/config.yaml file.

iptables -tnat -N ICP-NSXT
iptables -tnat -A POSTROUTING -j ICP-NSXT
iptables -tnat -A ICP-NSXT ! -s  <network_cidr> -d <network_cidr> -j MASQUERADE

Logging ELK pods are in CrashLoopBackOff state

Logging ELK pods continue to appear in CrashLoopBackOff state after upgrading to the current version and increasing memory.

This is a known issue Opens in a new tab in Elasticsearch 5.5.1.

Note: If you have more than one data-pod, repeat steps 1-8 for each pod. For example, logging-elk-data-0, logging-elk-data-1, or logging-elk-data-2.

Complete the following steps to resolve this issue.

Check the log to find the problematic file that contains the permission issue.

java.io.IOException: failed to write in data directory [/usr/share/elasticsearch/data/nodes/0/indices/dT4Nc7gvRLCjUqZQ0rIUDA/0/translog] write permission is required

Get the IP address of the management node where the logging-elk-data-1 pod is running.
```
kubectl -n kube-system get pods -o wide | grep logging-elk-data-1
```
Use SSH to log in to the management node.
Navigate to the /var/lib/icp/logging/elk-data directory.
```
cd /var/lib/icp/logging/elk-data
```
Find all .es_temp_file files.
```
find ./ -name "*.es_temp_file"
```
Delete all *.es_temp_file files that you find in step 5.
```
rm -rf *.es_temp_file
```

Delete the old logging-elk-data-1 pod.

kubectl -n kube-system delete pods logging-elk-data-1

Wait 3-5 minutes, for the new logging-elk-data-1 pod to restart.

kubectl -n kube-system get pods -o wide | grep logging-elk-data-1

Logs not working after logging pods are restarted

You may encounter the following problems:

The Kibana web UI shows Elasticsearch health status as red.

The Elasticsearch client pod log messages indicate that Search Guard is not initialized. Note that the same error repeats every few seconds. The messages resemble the following:

[2018-11-08T20:43:54,380][ERROR][c.f.s.a.BackendRegistry  ] Not yet initialized (you may need to run sgadmin)
[2018-11-08T20:43:54,487][ERROR][c.f.s.a.BackendRegistry  ] Not yet initialized (you may need to run sgadmin)
[2018-11-08T20:43:54,488][ERROR][c.f.s.a.BackendRegistry  ] Not yet initialized (you may need to run sgadmin)

If Vulnerabilty Advisor (VA) is installed, an error message appears in your VA logs that resembles the following:

2018-10-31 07:25:12,083 ERROR 229 <module>: Error: TransportError(503, u'Search Guard not initialized (SG11). See https://github.com/floragunncom/search-guard-docs/blob/master/sgadmin.md', None)

To resolve this issue, complete the following steps to run a Search Guard initialization job:

Save the existing Search Guard initialization job to a file.

 kubectl get job.batch/logging-elk-elasticsearch-tls-init -n kube-system -o yaml > sg-init-job.yaml

Edit the job file.
1. Remove everything under metadata.* except for the following parameters:
  - metadata.name
  - metadata.namespace
  - metadata.labels.*
2. Change metadata.name and spec.template.metadata.job-name to new names.
3. Remove spec.selector, spec.template.metadata.labels.controller-uid
4. Remove `status.*``
Save the file.
Run the job.
```
kubectl apply -f sg-init-job.yaml
```

Deploying contents in the default namespace might not be secure

A default namespace is tagged to all of the predefined pod security policies, ranging from very restrictive to least restrictive. Because the least restrictive policy takes priority over a more restrictive one, the least restrictive is applied for pods created by the content or chart deployment. This might cause a security risk if you need a higher level of security on the content or chart that you are deploying.

Timeouts and blank screens when displaying more than 80 namespaces

If a cluster has large number of namespaces, more than 80, you might see the following issues:

The namespace overview page might timeout and display a blank screen.
The Chart deployment configuration page might timeout and not load all the namespaces in the dropdown. Only the default namespace is shown for the deployment.

Required to create cluster roles for custom pod security policies before associating them to new namespaces

If a customized pod security policy is created, you must create a cluster role for the custom pod security policy to correctly associate the pod security policy to the namespace with the Web console during namespace creation. If a pod security policy exists without a cluster role, the role binding between the pod security policy cluster role and the newly created namespace is not added. When this occurs, the restrictions of the pod security policy RBAC association are not enforced.

Encrypting cluster data network traffic with IPsec does not work on SLES 12 SP3 operating system

strongSwan version 5.3.3 or higher is necessary to deploy IPsec mesh configuration for cluster data network traffic encryption. In SUSE Linux Enterprise Server (SLES) 12 SP3, the default strongSwan version is 5.1.3, which is not suitable for IPsec mesh configuration.

A namespace is stuck in the Terminating state

If a Kubernetes API extension is not available, the resources that are managed by the extension cannot be deleted. Failure to delete the API extension causes namespace deletion to fail.

Complete the following steps to get a descripion of the APIs that are not deleted:

Run the following command to view the namespaces that are stuck in a Terminating state:
```
kubectl get namespaces
```

Find the resources that are not deleted, run the following command to :

kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --show-all --ignore-not-found -n <terminating-namespace>

If the previous command returns the following error message: unable to retrieve the complete list of server APIs: <api-resource>/<version>: the server is currently unable to handle the request, continue to run the following command with the information you received:
```
kubectl get APIService <version>.<api-resource>
```
For example, run the following command for an API service that is named custom.metrics.k8s.io/v1beta1:
```
kubectl get APIService v1beta1.custom.metrics.k8s.io
```
Get a description of the API service to continue to debug your API service. Run the following command:
```
kubectl describe APIService <version>.<api-resource>
```
Make sure that the issue is resolved. Run the following command to verify that your namespace can be deleted:
```
kubectl get namespace
```

If the issue is not resolved, see A namespace is stuck in the Terminating state to manually delete a terminating namespace.

Cloning an IBM Cloud Private worker node is not supported

IBM Cloud Private does not support cloning an existing IBM Cloud Private worker node. You cannot change the host name and IP address of a node on your existing cluster.

You must add a new worker node. For more information, see Adding an IBM Cloud Private cluster node.

Some Pods not starting or log TLS handshake errors

In some cases when you are using IP-IP tunneling, some of your Pods do not start or contain log entries that indicate TLS handshake errors. If you notice either of these issues, complete the following steps to resolve the issue:

Run the ifconfig command or the netstat command to view the statistics of the tunnel device. The tunnel device is often named tunl0.

Note the changes in the TX dropped count that is displayed when you run the ifconfig command or the netstat command.

If you use the netstat command, enter a command similar to the following command:

 netstat --interface=tunl0

The output should be similar to the following content:

 Kernel Interface table
 Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
 tunl0     1300   904416      0      0 0        714067      0    806      0 ORU

If you use the ifconfig command, run a command similar to the following command:

 ifconfig tunl0

The output should be similar to the following content:

tunl0: flags=193 mtu 1300

 inet 10.1.125.192  netmask 255.255.255.255
 tunnel   txqueuelen 1000  (IPIP Tunnel)
 RX packets 904377  bytes 796710714 (759.8 MiB)
 RX errors 0  dropped 0  overruns 0  frame 0
 TX packets 714034  bytes 125963495 (120.1 MiB)
 TX errors 0  dropped 806 overruns 0  carrier 0  collisions 0

Run the command again and note the change in the TX dropped count that is displayed when you run the ifconfig command, or in the TX-DRP count that is displayed when you run the netstat command.

If the value is continuously increasing, there is an MTU issue. To resolve it, turn on tcp_mtu_probing and reduce the MTU value of the tunnel device.

Run the following commands to turn on tcp_mtu_probing:

echo 1 > /proc/sys/net/ipv4/tcp_mtu_probing
echo 1024 > /proc/sys/net/ipv4/tcp_base_mss

Add the following lines to the /etc/sysctl.conf file to make the settings permanent for future system restarts:
```
net.ipv4.tcp_mtu_probing =1
net.ipv4.tcp_base_mss = 1024
```
Complete the following steps to change the Calico IP-IP tunnel MTU after it is deployed.
1. Run the following command to edit the Calico settings file:
```
kubectl edit daemonset calico-node -n kube-system
```
2. Lower the value of the FELIX_IPINIPMTU variable as to reduce the MTU. The default value is 1430.
  
  Note: All Calico-node Pods are restarted to enable the new MTU setting.
Apply these settings to the sysctl.conf file to all of the nodes in the cluster.

Installation can fail with a `helm-api` setup error

Installation can fail with the following errors during the initial deployment of the helm-api chart:

  stderr: 'Error: secrets "rudder-secret" already exists'

You can view these errors in the install logs in the cluster/logs directory.

The errors occur because the Kubernetes secrets for rudder and helm-api are created in a pre-install hook. Issues like network issues that cause timeouts, incorrect configuration of the installation, and insufficient resources on the master node prevent the use of the secrets that were created by the prehook. When the deployment fails, the installer tries to deploy the chart four more times. On each retry, it fails when trying to recreate the secrets.

To resolve this issue, complete the following steps:

Run the uninstall procedure to remove all components of the installation.
Verify that the installation settings are correctly configured, including the config.yaml file, hosts, and other settings.
Run the installation procedure again.

The `cloudctl cm` node commands may not work accurately. Use `kubectl` instead

Certain IBM Cloud Private CLI cloudctl cm commands (cm), such as cloudctl cm nodes and other node commands, may not work accurately. These commands are deprecated and will be removed in a later release. Use kubectl instead.

For example, for results from cloudctl cm nodes, you need to use kubectl get nodes instead.

Cannot get secret by using kubectl command when encryption of secret data at rest is enabled

When you enable encryption of secret data at rest, and use kubectl command to get the secret, sometimes you might not be able to get the secret. You might see the following error message in kube-apiserver:

Internal error occurred: invalid padding on input

This error occurs because kube-apiserver failed to decrypt the encrypted data in etcd. For more information about the issue, see Random "invalid padding on input" errors when attempting various kubectl operations Opens in a new tab .

To resolve the issue, delete the secret and re-create it. Use the following command:

kubectl -n <namespace> delete secret <secret>

For more information about encrypting secret data at rest, see Encrypting Secret Data at Rest Opens in a new tab .

Vulnerability Advisor cannot scan unsupported container images

Container images that are not supported by the Vulnerability Advisor fail the security scan.

The Security Scan column displays Failed from the Container Images page in the management console. When you select failed container image name to view more details, zero issues are detected.

Known issues and limitations

Logs older than what is specified in log retention policy are recreated if Filebeat is restarted

Elasticsearch type mapping limitations

Metering data shows large numbers for available and capped cores

Kubernetes API Server vulnerability

Resource quota might not update

Container fails to start due to Docker issue

The Key Management Service must deploy to a management node in a Linux® x86_64 platform

Dynamic configuration limitation on Linux® on Power® (ppc64le) and IBM® Z nodes

Cookie affinity does not work when FIPS is enabled

Sticky sessions must be manually set on Linux® on IBM® Z and LinuxONE and Linux® on Power® (ppc64le)

The Grafana UI cannot be opened after upgrading the monitoring service release version

Pre-upgrade steps

Post-upgrade steps

Tiller 2.7.2 does not support the upgrade or install of Kubernetes 1.8 resources

Alerting, logging, or monitoring pages displays 500 Internal Server Error

IPv6 is not supported

Cannot log in to the management console with an LDAP user after restarting the leading master

Calico prefix limitation on Linux® on Power® (ppc64le) nodes

StatefulSets remain in Terminating state after a worker node shuts down

Synchronizing repositories might not update Helm chart contents

Some features are not available from the new management console

Containers fail to start or a kernel panic occurs

The management console displays 502 Bad Gateway Error

Enable Ingress Controller to use a new annotation prefix

Monitoring data is not retained if you use a dynamically provisioned volume during upgrade

Cannot restart node when using vSphere storage

Truncated labels are displayed on the dashboard for some languages

Helm repository names can not contain DBCS GB18030 characters

GlusterFS cluster becomes unusable if you configure a vSphere Cloud Provider after installing IBM Cloud Private

A failed upgrade or rollback of IBM Cloud Private creates two release entries with different statuses

Prometheus data source is lost during a rollback of IBM Cloud Private

Vulnerability Advisor cross-architecture image scanning does not work with glibc version earlier than 2.22

Container fails to operate or a kernel panic occurs

Intermittent failure when you log in to the management console in HA clusters that use NSX-T 2.3

Vulnerability Advisor policy resets to default setting after upgrade from 3.1.0 in ppc64le cluster

Containers can crash when running IBM Cloud Private on KVM on POWER guests.

Linux kernel memory leak

In an NSX-T environment, when you restart a master node, the management console becomes inaccessible.

Logging ELK pods are in CrashLoopBackOff state

Logs not working after logging pods are restarted

Deploying contents in the default namespace might not be secure

Timeouts and blank screens when displaying more than 80 namespaces

Required to create cluster roles for custom pod security policies before associating them to new namespaces

Encrypting cluster data network traffic with IPsec does not work on SLES 12 SP3 operating system

A namespace is stuck in the Terminating state

Cloning an IBM Cloud Private worker node is not supported

Some Pods not starting or log TLS handshake errors

Installation can fail with a helm-api setup error

The cloudctl cm node commands may not work accurately. Use kubectl instead

Cannot get secret by using kubectl command when encryption of secret data at rest is enabled

Vulnerability Advisor cannot scan unsupported container images

Vulnerability Advisor cross-architecture image scanning does not work with `glibc` version earlier than 2.22

Installation can fail with a `helm-api` setup error

The `cloudctl cm` node commands may not work accurately. Use `kubectl` instead