Known issues and limitations
Review the known issues for version 3.1.1.
- Logs older than what is specified in log retention policy are recreated if Filebeat is restarted
- Elasticsearch type mapping limitations
- Metering data shows large numbers for available and capped cores
- Kubernetes API Server vulnerability
- Resource quota might not update
- Container fails to start due to Docker issue
- The Key Management Service must deploy to a management node in a Linux® x86_64 platform
- Dynamic configuration limitation on Linux® on Power® (ppc64le) and IBM® Z nodes
- Cookie affinity does not work when FIPS is enabled
- Sticky sessions must be manually set on Linux® on IBM® Z and LinuxONE and Linux® on Power® (ppc64le)
- The Grafana UI cannot be opened after upgrading the monitoring service release version
- Tiller 2.7.2 does not support the upgrade or install of Kubernetes 1.10 resources
- Alerting, logging, or monitoring pages displays 500 Internal Server Error
- IPv6 is not supported
- Cannot log in to the management console with an LDAP user after restarting the leading master
- Calico prefix limitation on Linux® on Power® (ppc64le) nodes
- StatefulSets remain in Terminating state after a worker node shuts down
- Syncing repositories might not update Helm chart contents
- Some features are not available from the new management console
- Containers fail to start or a kernel panic occurs
- The management console displays 502 Bad Gateway Error
- Enable Ingress Controller to use a new annotation prefix
- Monitoring data is not retained if you use a dynamically provisioned volume during upgrade
- Cannot restart node when using vSphere storage
- Truncated labels are displayed on the dashboard for some languages
- Helm repository names cannot contain DBCS GB18030 characters
- GlusterFS cluster becomes unusable if you configure a vSphere Cloud Provider after installing IBM Cloud Private
- A failed upgrade or rollback of IBM Cloud Private creates two release entries with different statuses
- Prometheus data source is lost during a rollback of IBM Cloud Private
- Vulnerability Advisor cross-architecture image scanning does not work with
glibcversion earlier than 2.22 - Container fails to operate or a kernel panic occurs
- Intermittent failure when you log in to the management console in HA clusters that use NSX-T 2.3
- Vulnerability Advisor policy resets to the default setting after you upgrade from 3.1.0 in ppc64le cluster
- Containers can crash when running IBM Cloud Private on KVM on Power guests.
- Linux kernel memory leak
- In an NSX-T environment, when you restart a master node, the management console becomes inaccessible.
- Logging ELK pods are in CrashLoopBackOff state
- Logs not working after logging pods are restarted
- Deploying contents in the default namespace might not be secure
- Timeouts and blank screens when displaying 80+ namespaces
- Required to create cluster roles for custom pod security policies before associating them to new namespaces
- Encrypting cluster data network traffic with IPsec does not work on SLES 12 SP3 operating system
- A namespace is stuck in the Terminating state
- Cloning an IBM Cloud Private worker node is not supported
- Some Pods not starting or log TLS handshake errors
- Installation can fail with a
helm-apisetup error - Certain
cloudctl cmcommands may not work accurately. Usekubectlinstead - Cannot get secret by using kubectl command when encryption of secret data at rest is enabled
- Vulnerability Advisor cannot scan unsupported container images
Logs older than what is specified in log retention policy are recreated if Filebeat is restarted
A curator background job is deployed as part of the IBM Cloud Private logging service. To free disk space, the job runs once a day to remove old log data based on your retention settings.
If Filebeat pods are restarted, Filebeat finds all existing log files, and reprocesses and reingests them. This activity includes log entries that are older than what is specified by the log retention policy. This behavior can cause older logs to
be reindexed to Elasticsearch, and appear in the logging console until the curator job executes again. If this is problematic, you can manually delete indices older than your retention settings. For more information, see Manually removing log indices.
Elasticsearch type mapping limitations
The IBM Cloud Private logging component uses Elasticsearch to store and index logs that are received from all the running containers in the cluster. If containers emit logs in JSON format, each field in the JSON is indexed by Elasticsearch to allow queries to use the fields. However, if two containers define the same field while they send different data types, Elasticsearch is not able to index the field correctly. The first type that is received for a field each day sets the accepted type for the rest of the day. This action results in two problems:
- In IBM Cloud Private version 3.1.2 and earlier, log messages with non-matching types are discarded. In IBM Cloud Private version 3.2.0 and later, the log messages are accepted but the non-matching fields are not indexed. If you run a query using
that field, you do not find the non-matching documents. Some scenarios primarily involving fields that are sometimes objects can still result in discarded log messages. For more information, see Elasticsearch issue 12366
.
- If the type for a field is different over several days, queries from Kibana can result in errors such as
5 of 30 shards failed. To work around this issue, complete the following steps to force Kibana to recognize the type mismatch:- From the Kibana navigation menu, click Management
- Select Index patterns
- Click Refresh field list
Metering data shows large numbers for available and capped cores
Metering service can report a large number when the number of available cores that are allocated to a node is not in whole-core increments. In some clusters, you might decide to reserve some number of CPU cores for system processes. This action can cause partial cores to be available for workloads on a node. When this event happens, the metering service reports many available and capped processors (by a factor of 1000). For instance, 7.2 cores are metered as 7200 cores. To work around this issue, reserve whole-core increments. Alternatively, contact IBM support.
Kubernetes API Server vulnerability
IBM Cloud Private has a patch (icp-3.1.1-build508531) on IBM® Fix Central to address the Kubernetes security vulnerability, where the proxy request handling in the Kubernetes API Server can leave vulnerable TCP connections. For
full details, see the Kubernetes kube-apiserver vulnerability issue . After you
apply the patch, you do not need to redeploy either IBM Cloud Private or your Helm releases. You must reapply the patch if you replace your master node.
Resource quota might not update
You might find that the resource quota is not updating in the cluster. This is due to an issue in the kube-controller-manager. The workaround is to stop the kube-controller-manager leader container on the master nodes and let it restart. If high availability is configured for the cluster, you can check the kube-controller-manager log to find the leader. Only the leader kube-controller-manager is working. The other controllers wait to be elected as the new leader once the current leader is down.
For example:
# docker ps | grep hyperkube | grep controller-manager
97bccea493ea 4c7c25836910 "/hyperkube controll…" 7 days ago Up 7 days k8s_controller-manager_k8s-master-9.111.254.104_kube-system_b0fa31e0606015604c409c09a057a55c_2
To stop the leader, run the following command with the ID of the Docker process:
docker rm -f 97bccea493ea
Container fails to start due to Docker issue
Installation fails during container creation due to a Docker 18.03.1 issue. If you have a subPath in the volume mount, you might receive the following error from the kubelet service, which fails to start the container:
Error: failed to start container "heketi": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/kubelet/pods/7e9cb34c-b2bf-11e8-a9eb-0050569bdc9f/volume-subpaths/heketi-db-secret/heketi/0\\\" to rootfs \\\"/var/lib/docker/overlay2/ca0a54812c6f5718559cc401d9b73fb7ebe43b2055a175ee03cdffaffada2585/merged\\\" at \\\"/var/lib/docker/overlay2/ca0a54812c6f5718559cc401d9b73fb7ebe43b2055a175ee03cdffaffada2585/merged/backupdb/heketi.db.gz\\\" caused \\\"no such file or directory\\\"\"": unknown
For more information, see the Kubernetes documentation .
To resolve this issue, delete the failed pod and try the installation again.
The Key Management Service must deploy to a management node in a Linux® x86_64 platform
The Key Management Service is deployed to the management node and is supported only on the Linux® x86_64 platform. If there is no amd64 management node in the cluster, the Key Management Service is not deployed.
Dynamic configuration limitation on Linux® on Power® (ppc64le) and IBM® Z nodes
For IBM Cloud Private version 3.1.0, NGINX ingress controller is upgraded to version 0.16.2. Because LuaJIT is not available on IBM® Z (s390x) and Linux® on Power® (ppc64le) architectures, the NGINX controller will disable the dynamic
configuration features during startup. For more details, see https://github.com/kubernetes/ingress-nginx/blob/nginx-0.16.2/cmd/nginx/flags.go#L217-L224 .
Cookie affinity does not work when FIPS is enabled
When Federal Information Processing Standard (FIPS) is enabled, cookie affinity does not work because nginx.ingress.kubernetes.io/session-cookie-hash can be set only to sha1/md5/index, which is not supported in FIPS mode.
Sticky sessions must be manually set on Linux® on IBM® Z and LinuxONE and Linux® on Power® (ppc64le)
Because LuaJIT is unavailable, Session Affinity is handled by the nginx-sticky-module-ng module. You must enable sticky session manually. For more information see, Cannot set the cookie for sticky sessions .
The Grafana UI cannot be opened after upgrading the monitoring service release version
If persistent volume is enabled for the monitoring service and the Grafana password is not set during the monitoring service installation, after upgrading to a later version, the Grafana dashboard is not accessible. You can see the error message {"message":"Invalid username or password"} when trying to access the Grafana dashboard. You can resolve this problem either before you upgrade the monitoring release or after upgrading.
Pre-upgrade steps
-
Obtain the Grafana password from
monitoring-grafana-secret:export PASSWORD=$(kubectl get -n kube-system secret/monitoring-grafana-secret -o yaml|grep password|awk -F': ' '{print $2}'|base64 -d) -
Set the Grafana password during upgrade. If you upgrade by using the IBM Cloud Private dashboard, set the Grafana password in the configuration page during upgrade. If you upgrade by using the Helm CLI, follow these steps:
-
Obtain the
values.yamlof the existing release:helm get --tls monitoring >> values.yaml -
Insert the Grafana password from Step 1 into the
values.yaml:grafana: password: PASSWORD -
Run the Helm upgrade command:
helm upgrade --tls monitoring -f values.yaml ibm-icpmonitoring-1.3.0.tar.gz
-
Post-upgrade steps
-
Obtain the Grafana password from
monitoring-grafana-secret:export PASSWORD=$(kubectl get -n kube-system secret/monitoring-grafana-secret -o yaml|grep password|awk -F': ' '{print $2}'|base64 -d)Note: Option
-dfor the base64 command is used to decode the encoded password. The option might vary on different operation systems. For example, in macOS, the option is-D. -
Obtain the Grafana pod name:
export GRAFANA_POD=$(kubectl get pod -n kube-system|grep grafana|grep Running|awk '{split($0, a, " "); print a[1]}') -
Reset the password for the Grafana container:
kubectl exec -n kube-system $GRAFANA_POD -c grafana -it -- grafana-cli admin reset-admin-password --homepath "/usr/share/grafana" $PASSWORD
Note: If you resolve this problem post-upgrade, the problem might occur again after rollback. You must follow the same post-upgrade steps to fix it.
Tiller 2.7.2 does not support the upgrade or install of Kubernetes 1.8 resources
Tiller version 2.7.2 is installed with IBM Cloud Private version 3.1.1. Tiller 2.7.2 uses Kubernetes API version 1.7. You cannot install or upgrade Helm charts that use only Kubernetes version 1.8 resources.
You might encounter a Helm release upgrade error. The error message will resemble the following content:
Error: UPGRADE FAILED: failed to create patch: unable to find api field in struct Unstructured for the json field "spec"
If you encounter this error message, you must delete the release and install a new version of the chart.
Alerting, logging, or monitoring pages displays 500 Internal Server Error
To resolve this issue, complete the following steps from the master node:
-
Create an alias for the insecure kubectl api log in by running the following command:
alias kc='kubectl -n kube-system' -
Edit the configuration map for Kibana. Run the following command:
kc edit cm kibana-nginx-configAdd the following updates:
upstream kibana { server localhost:5602; } Change localhost to 127.0.0.1 -
Locate and restart the Kibana pod by running the following commands:
kc get pod | grep -i kibanakc delete pod <kibana-POD_ID> -
Edit the configuration map for Grafana by running the following command:
kc edit cm grafana-router-nginx-configAdd the following updates:
upstream grafana { server localhost:3000; } Change localhost to 127.0.0.1 -
Locate and restart the Grafana pod by running the following commands:
kc get pod | grep -i monitoring-grafanakc delete pod <monitoring-grafana-POD_ID> -
Edit the configuration map for the Alertmanager by running the following command:
kc edit cm alertmanager-router-nginx-configAdd the following updates:
upstream alertmanager { server localhost:9093; } Change localhost to 127.0.0.1 -
Locate and restart the Alertmanager by running the following commands:
kc get pod | grep -i monitoring-prometheus-alertmanagerkc delete pod <monitoring-prometheus-alertmanager-POD_ID>
IPv6 is not supported
IBM Cloud Private cannot use IPv6 networks. Comment out the settings in the /etc/hosts file on each cluster node to remove the IPv6 settings. For more information, see Configuring your cluster.
Cannot log in to the management console with an LDAP user after restarting the leading master
If you cannot log in to the management console after you restart the leading master node in a high availability cluster, take the following actions:
- Log in to the management console with the cluster administrator credentials. The user name is
admin, and the password isadmin. - Click Menu > Manage > Identity & Access.
-
Click Edit and then click Save.
Note: LDAP users can log in to the management console.
If the problem persists, MongoDB, MariaDB, and the pods that depend on auth-idp might not be running. Follow these instructions to identify the cause.
-
Check whether MongoDB and MariaDB pods are running without any errors.
-
Use the following command to check the pod status. All pods must show the status as
1/1 Running. Check the logs, if required.kubectl -n kube-system get pods | grep -e mariadb -e mongodb -
If the pods do not show the status as
1/1 Running, restart all the pods by deleting them.kubectl -n kube-system delete pod -l k8s-app=mariadbkubectl -n kube-system delete pod -l app=icp-mongodbWait for a minute or two for the pods to restart. Check the pod status by using the following command. The status must show
1/1 Running.kubectl -n kube-system get pods | grep -e mariadb -e mongodb
-
-
After the MongoDB and MariaDB pods are running, restart the
auth-idppods by deleting them.kubectl -n kube-system delete pod -l k8s-app=auth-idpWait for a minute or two for the pods to restart. Check the pod status by using the following command. The status must show
4/4 Running.kubectl -n kube-system get pods | grep auth-idp
Calico prefix limitation on Linux® on Power® (ppc64le) nodes
If you install IBM Cloud Private on PowerVM Linux LPARs and your virtual Ethernet devices use the ibmveth prefix, you must set the network adapter to use Calico networking. During installation, be sure to set a calico_ip_autodetection_method parameter value in the config.yaml file. The setting resembles the following content:
calico_ip_autodetection_method: interface=<device_name>
The <device_name> parameter is the name of your network adapter. You must specify the ibmveth0 interface on each node of the cluster, including the worker nodes.
Note: If you used PowerVC to deploy your cluster node, this issue does not affect you.
StatefulSets remain in Terminating state after a worker node shuts down
If the node where the StatefulSet pod is running shut down, the pod for the StatefulSet enters a Terminating state. You must manually delete the pod that is stuck in the Terminating state to force
it to re-create on another node.
To delete the pod, run the following command:
kubectl -n <namespace> delete pods --grace-period=0 --force <pod_name>
For more information about Kubernetes pod safety management, see Pod Safety, Consistency Guarantees, and Storage Implications in the Kubernetes community feature specs.
Synchronizing repositories might not update Helm chart contents
Synchronizing repositories takes several minutes to complete. While synchronization is in progress, there might be an error if you try to display the readme file. After synchronization completes, you can view the readme file and deploy the chart.
Some features are not available from the new management console
IBM Cloud Private 3.1.1 supports the new management console only. Some options from the previous console are not yet available. To access the options from the previous console you must use the kubectl CLI for the functions.
Containers fail to start or a kernel panic occurs
For Red Hat Enterprise Linux (RHEL) only: Containers fail to start or a kernel panic occurs and a no space left on device error message is displayed for Red Hat Enterprise Linux (RHEL) only. This issue is a known Docker
engine issue that is caused by the leaking of cgroups. For more information about this issue, see https://github.com/moby/moby/issues/29638 and https://github.com/kubernetes/kubernetes/issues/61937
.
To fix this issue, you must restart the host.
The management console displays 502 Bad Gateway Error
The management console displays a 502 Bad Gateway Error after installing or rebooting the master node.
If you recently installed IBM Cloud Private, wait a few minutes and reload the page.
If you rebooted the master node, take the following steps:
-
Obtain the IP addresses of the
icp-dspods. From the master node, run the following command:kubectl get pods -o wide -n kube-system | grep "icp-ds"The output resembles the following text:
icp-ds-0 1/1 Running 0 1d 10.1.231.171 10.10.25.134In this example,
10.1.231.171is the IP address of the pod.In high availability (HA) environments, an
icp-dspod exists for each master node. -
From the master node, ping the
icp-dspods. Check the IP address for eachicp-dspod by running the following command for each IP address:ping 10.1.231.171If the output resembles the following text, you must delete the pod:
connect: Invalid argument -
From the master node, delete each pod that is unresponsive by running the following command:
kubectl delete pods icp-ds-0 -n kube-systemIn this example,
icp-ds-0is the name of the unresponsive pod.Important: In HA installations, you might have to delete the pod for each master node.
-
From the master node, obtain the IP address of the replacement pod or pods by running the following command:
kubectl get pods -o wide -n kube-system | grep "icp-ds"The output resembles the following text:
icp-ds-0 1/1 Running 0 1d 10.1.231.172 10.10.2 -
From the master node, ping the pods again and check the IP address for each
icp-dspod by running the following command for each IP address:ping 10.1.231.172If all
icp-dspods are responsive, you can access the IBM Cloud Private management console when that pod enters the available state.
Enable Ingress Controller to use a new annotation prefix
-
The NGINX ingress annotation contains a new prefix in version 0.9.0 that is used in IBM Cloud Private 3.1.1
nginx.ingress.kubernetes.io. This change uses the flag to avoid breaks to deployments that are running.- To avoid breaking a running NGINX ingress controller, add the
--annotations-prefix=ingress.kubernetes.ioflag to the nginx ingress controller deployment. The product accepts the flag by default in IBM Cloud Private ingress controller.
- To avoid breaking a running NGINX ingress controller, add the
-
If you want to use the new ingress annotation, update the ingress controller by removing the
--annotations-prefix=ingress.kubernetes.ioflag. To remove the flag run the following commands:Note: Run the following commands from the master node.
For Linux® x86_64, run the following command:
kubectl edit ds nginx-ingress-lb-amd64 -n kube-systemFor Linux® on Power® (ppc64le) run the following command:
kubectl edit ds nginx-ingress-lb-ppc64le -n kube-systemSave and exit to implement the change. Ingress controller restarts to receive the new configuration.
Monitoring data is not retained if you use a dynamically provisioned volume during upgrade
If you use a dynamically provisioned persistent volume to store monitoring data, the data is lost after you upgrade the monitoring service from 2.1.0.2 to 2.1.0.3.
Cannot restart node when using vSphere storage
Shutting down a cluster in an IBM Cloud Private environment that uses vSphere Cloud moves the pod to another node in your cluster. However, the vSphere volume that the pod uses on the original node is not detached from the node. An error might occur when you try to restart the node.
To resolve the issue, first detach the volume from the node. Then, restart the node.
Truncated labels are displayed on the dashboard for some languages
If you access the IBM Cloud Private dashboard in languages other than English from the Mozilla Firefox browser on a system that uses a Windows™ operating system, some labels might be truncated.
Helm repository names can not contain DBCS GB18030 characters
Do not use DBCS GB18030 characters in the Helm repository name when you add the repository.
GlusterFS cluster becomes unusable if you configure a vSphere Cloud Provider after installing IBM Cloud Private
By default, the kubelet uses the IP address of the node as the node name. When you configure a vSphere Cloud Provider, kubelet uses the host name of the node as the node name. If you had your GlusterFS cluster set up during installation of IBM Cloud Private, Heketi creates a topology by using the IP address of the node.
When you configure a vSphere Cloud Provider after you install IBM Cloud Private, your GlusterFS cluster becomes unusable because the kubelet identifies nodes by their host names, but Heketi still uses IP addresses to identify the nodes.
If you plan to use both GlusterFS and a vSphere Cloud Provider in your IBM Cloud Private cluster, ensure that you set kubelet_nodename: hostname in the config.yaml file during installation.
A failed upgrade or rollback of IBM Cloud Private creates two release entries with different statuses
A failed upgrade or rollback results in two listed releases with the same name: one successful release that has not been upgraded or rolled back, and the failed upgraded or rolled back release.
These two releases with the same name are two revisions of the same release, so deleting one deletes the other. This issue of showing more than one revision of a release is a known community Helm 2.7.2 issue. For more information, see https://github.com/kubernetes/helm/issues/2941.
Prometheus data source is lost during a rollback of IBM Cloud Private
When you roll back from IBM Cloud Private Version 3.1.1 to 3.1.0, the Prometheus data source in Grafana is lost. The Grafana dashboards do not display any metric.
To resolve the issue, add back the Prometheus data source by completing the steps in the Manually configure a Prometheus data source in Grafana section.
Vulnerability Advisor cross-architecture image scanning does not work with glibc version earlier than 2.22
Vulnerability Advisor (VA) now supports cross-architecture image scanning with QEMU (Quick EMUlator). You can scan Linux® on Power® (ppc64le) CPU architecture images with VA running on Linux® x86_64 nodes. Alternatively, you can scan Linux® x86_64 CPU architecture images with VA running on Linux® on Power® (ppc64le) nodes.
When scanning Linux® x86_64 images, you must use glibc version 2.22 or later. If you use glibc version earlier than 2.22, the scan might not work when VA runs on Linux® on Power® (ppc64le) nodes. Glibc versions
earlier than 2.22 make certain syscalls (time/vgetcpu/getttimeofday) by using vsyscall mechanisms. The syscall implementation attempts to access hardcoded static address, which QEMU fails to translate while running in emulation mode.
Container fails to operate or a kernel panic occurs
The following error might occur from the IBM Cloud Private node console or kernel log:
kernel:unregister_netdevice: waiting for <eth0> to become free.
If you receive this error, the log displays both kernal:unregister_netdevice: waiting for <eth0> to be free and containers fail to operate. Continue to troubleshoot. If you meet all required conditions, reboot the
node.
View https://github.com/kubernetes/kubernetes/issues/64743 to learn about the Linux
Kernel bug that causes the error.
Intermittent failure when you log in to the management console in HA clusters that use NSX-T 2.3
In HA clusters that use NSX-T 2.3, you might not be able to log in to the management console. After you specify the login credentials, you are redirected to the login page. You might have to try logging in multiple times until you succeed. This issue is intermittent.
Vulnerability Advisor policy resets to default setting after upgrade from 3.1.0 in ppc64le cluster
If you enabled Vulnerability Advisor (VA) on your Linux® on Power® (ppc64le) cluster in 3.1.0, the Vulnerability Advisor policy resets to the default setting when you upgrade to 3.1.1. To fix this issue, reset the VA policy in the management console.
Containers can crash when running IBM Cloud Private on KVM on POWER guests.
If you are running IBM Cloud Private on KVM on Power guests, some containers might crash because of an issue with how the Transaction Memory is handled. You can work around this issue by using one of the following methods:
- Turn off the Transaction Memory support for KVM on Power guests.
- If you are using the Oemu emulator directly to run the virtual machine, enable the
cap-htm=offoption. - If you are using the libvirt library, add the following XML attribute to the domain definition:
See the libvirt documentation<features> <htm state='on'/> </features>for the detailed instructions about adding this libvirt attribute. Note: This issue is specific to KVM on Power guests and does not occur when using POWER9 bare metal or POWER9 PowerVM LPARs.
Linux kernel memory leak
Linux kernels older than release 4.17.17 contain a bug that causes kernel memory leaks in cgroup (community link). When pods in the host are restarted multiple times, the host can run out of kernel memory. This problem causes pod start failures and hung systems.
As shown in the following example, you can check your kernel core dump file and view the core stack:
[700556.898399] Call Trace:
[700556.898406] [<ffffffff8184bdb0>] ? bit_wait+0x60/0x60
[700556.898408] [<ffffffff8184b5b5>] schedule+0x35/0x80
[700556.898411] [<ffffffff8184e746>] schedule_timeout+0x1b6/0x270
[700556.898415] [<ffffffff810f90ee>] ? ktime_get+0x3e/0xb0
[700556.898417] [<ffffffff8184bdb0>] ? bit_wait+0x60/0x60
[700556.898420] [<ffffffff8184ad24>] io_schedule_timeout+0xa4/0x110
[700556.898422] [<ffffffff8184bdcb>] bit_wait_io+0x1b/0x70
[700556.898425] [<ffffffff8184b95f>] __wait_on_bit+0x5f/0x90
[700556.898429] [<ffffffff8119200b>] wait_on_page_bit+0xcb/0xf0
[700556.898433] [<ffffffff810c6de0>] ? autoremove_wake_function+0x40/0x40
[700556.898435] [<ffffffff81192123>] __filemap_fdatawait_range+0xf3/0x160
[700556.898437] [<ffffffff811921a4>] filemap_fdatawait_range+0x14/0x30
[700556.898439] [<ffffffff8119414f>] filemap_write_and_wait_range+0x3f/0x70
[700556.898444] [<ffffffff8129af08>] ext4_sync_file+0x108/0x350
[700556.898447] [<ffffffff812486de>] vfs_fsync_range+0x4e/0xb0
[700556.898449] [<ffffffff8124879d>] do_fsync+0x3d/0x70
[700556.898451] [<ffffffff81248a63>] SyS_fdatasync+0x13/0x20
[700556.898453] [<ffffffff8184f788>] entry_SYSCALL_64_fastpath+0x1c/0xbb
[700599.233973] mptscsih: ioc0: attempting task abort! (sc=ffff880fd344e100)
To work around the failures, you can restart the host. However, you might encounter the problem again. To avoid the problem, it is recommended that you upgrade your Linux kernel to release 4.17.17 or higher. Release 4.17.17 contains fixes for the kernel bug.
View Changing the cgroup driver to systemd on Red Hat Enterprise Linux on IBM Cloud Private troubleshoot page for more information.
In an NSX-T environment, when you restart a master node, the management console becomes inaccessible.
In an NSX-T environment, when you restart a master node, the management console is inaccessible even though all the service pods are in a good state. This issue is caused because of non-persistent IPtable NAT rules, which help host port and pod communication through host IP. NSX-T does not support host port. IBM Cloud Private uses host port for management console.
To resolve the issue, run the following commands on all the master nodes. Use the network CIDR that you specified in the /<installation_directory>/cluster/config.yaml file.
iptables -tnat -N ICP-NSXT
iptables -tnat -A POSTROUTING -j ICP-NSXT
iptables -tnat -A ICP-NSXT ! -s <network_cidr> -d <network_cidr> -j MASQUERADE
Logging ELK pods are in CrashLoopBackOff state
Logging ELK pods continue to appear in CrashLoopBackOff state after upgrading to the current version and increasing memory.
This is a known issue in Elasticsearch 5.5.1.
Note: If you have more than one data-pod, repeat steps 1-8 for each pod. For example, logging-elk-data-0, logging-elk-data-1, or logging-elk-data-2.
Complete the following steps to resolve this issue.
-
Check the log to find the problematic file that contains the permission issue.
java.io.IOException: failed to write in data directory [/usr/share/elasticsearch/data/nodes/0/indices/dT4Nc7gvRLCjUqZQ0rIUDA/0/translog] write permission is required -
Get the IP address of the management node where the logging-elk-data-1 pod is running.
kubectl -n kube-system get pods -o wide | grep logging-elk-data-1 -
Use SSH to log in to the management node.
-
Navigate to the
/var/lib/icp/logging/elk-datadirectory.cd /var/lib/icp/logging/elk-data -
Find all
.es_temp_filefiles.find ./ -name "*.es_temp_file" -
Delete all
*.es_temp_filefiles that you find in step 5.rm -rf *.es_temp_file -
Delete the old logging-elk-data-1 pod.
kubectl -n kube-system delete pods logging-elk-data-1 -
Wait 3-5 minutes, for the new logging-elk-data-1 pod to restart.
kubectl -n kube-system get pods -o wide | grep logging-elk-data-1
Logs not working after logging pods are restarted
You may encounter the following problems:
- The Kibana web UI shows Elasticsearch health status as red.
-
The Elasticsearch client pod log messages indicate that Search Guard is not initialized. Note that the same error repeats every few seconds. The messages resemble the following:
[2018-11-08T20:43:54,380][ERROR][c.f.s.a.BackendRegistry ] Not yet initialized (you may need to run sgadmin) [2018-11-08T20:43:54,487][ERROR][c.f.s.a.BackendRegistry ] Not yet initialized (you may need to run sgadmin) [2018-11-08T20:43:54,488][ERROR][c.f.s.a.BackendRegistry ] Not yet initialized (you may need to run sgadmin) -
If Vulnerabilty Advisor (VA) is installed, an error message appears in your VA logs that resembles the following:
2018-10-31 07:25:12,083 ERROR 229 <module>: Error: TransportError(503, u'Search Guard not initialized (SG11). See https://github.com/floragunncom/search-guard-docs/blob/master/sgadmin.md', None)
To resolve this issue, complete the following steps to run a Search Guard initialization job:
-
Save the existing Search Guard initialization job to a file.
kubectl get job.batch/logging-elk-elasticsearch-tls-init -n kube-system -o yaml > sg-init-job.yaml -
Edit the job file.
- Remove everything under metadata.* except for the following parameters:
metadata.namemetadata.namespacemetadata.labels.*
- Change
metadata.nameandspec.template.metadata.job-nameto new names. - Remove
spec.selector, spec.template.metadata.labels.controller-uid - Remove `status.*``
- Remove everything under metadata.* except for the following parameters:
-
Save the file.
- Run the job.
kubectl apply -f sg-init-job.yaml
Deploying contents in the default namespace might not be secure
A default namespace is tagged to all of the predefined pod security policies, ranging from very restrictive to least restrictive. Because the least restrictive policy takes priority over a more restrictive one, the least restrictive is applied for pods created by the content or chart deployment. This might cause a security risk if you need a higher level of security on the content or chart that you are deploying.
Timeouts and blank screens when displaying more than 80 namespaces
If a cluster has large number of namespaces, more than 80, you might see the following issues:
- The namespace overview page might timeout and display a blank screen.
- The Chart deployment configuration page might timeout and not load all the namespaces in the dropdown. Only the
defaultnamespace is shown for the deployment.
Required to create cluster roles for custom pod security policies before associating them to new namespaces
If a customized pod security policy is created, you must create a cluster role for the custom pod security policy to correctly associate the pod security policy to the namespace with the Web console during namespace creation. If a pod security policy exists without a cluster role, the role binding between the pod security policy cluster role and the newly created namespace is not added. When this occurs, the restrictions of the pod security policy RBAC association are not enforced.
Encrypting cluster data network traffic with IPsec does not work on SLES 12 SP3 operating system
strongSwan version 5.3.3 or higher is necessary to deploy IPsec mesh configuration for cluster data network traffic encryption. In SUSE Linux Enterprise Server (SLES) 12 SP3, the default strongSwan version is 5.1.3, which is not suitable for IPsec mesh configuration.
A namespace is stuck in the Terminating state
If a Kubernetes API extension is not available, the resources that are managed by the extension cannot be deleted. Failure to delete the API extension causes namespace deletion to fail.
Complete the following steps to get a descripion of the APIs that are not deleted:
-
Run the following command to view the namespaces that are stuck in a Terminating state:
kubectl get namespaces -
Find the resources that are not deleted, run the following command to :
kubectl api-resources --verbs=list --namespaced -o name | xargs -n 1 kubectl get --show-kind --show-all --ignore-not-found -n <terminating-namespace> -
If the previous command returns the following error message:
unable to retrieve the complete list of server APIs: <api-resource>/<version>: the server is currently unable to handle the request, continue to run the following command with the information you received:kubectl get APIService <version>.<api-resource>For example, run the following command for an API service that is named
custom.metrics.k8s.io/v1beta1:kubectl get APIService v1beta1.custom.metrics.k8s.io -
Get a description of the API service to continue to debug your API service. Run the following command:
kubectl describe APIService <version>.<api-resource> -
Make sure that the issue is resolved. Run the following command to verify that your namespace can be deleted:
kubectl get namespace
If the issue is not resolved, see A namespace is stuck in the Terminating state to manually delete a terminating namespace.
Cloning an IBM Cloud Private worker node is not supported
IBM Cloud Private does not support cloning an existing IBM Cloud Private worker node. You cannot change the host name and IP address of a node on your existing cluster.
You must add a new worker node. For more information, see Adding an IBM Cloud Private cluster node.
Some Pods not starting or log TLS handshake errors
In some cases when you are using IP-IP tunneling, some of your Pods do not start or contain log entries that indicate TLS handshake errors. If you notice either of these issues, complete the following steps to resolve the issue:
-
Run the
ifconfigcommand or thenetstatcommand to view the statistics of the tunnel device. The tunnel device is often named tunl0. -
Note the changes in the TX dropped count that is displayed when you run the
ifconfigcommand or thenetstatcommand.If you use the
netstatcommand, enter a command similar to the following command:netstat --interface=tunl0The output should be similar to the following content:
Kernel Interface table Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg tunl0 1300 904416 0 0 0 714067 0 806 0 ORUIf you use the
ifconfigcommand, run a command similar to the following command:ifconfig tunl0The output should be similar to the following content:
tunl0: flags=193
mtu 1300 inet 10.1.125.192 netmask 255.255.255.255 tunnel txqueuelen 1000 (IPIP Tunnel) RX packets 904377 bytes 796710714 (759.8 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 714034 bytes 125963495 (120.1 MiB) TX errors 0 dropped 806 overruns 0 carrier 0 collisions 0 -
Run the command again and note the change in the TX dropped count that is displayed when you run the
ifconfigcommand, or in the TX-DRP count that is displayed when you run thenetstatcommand.If the value is continuously increasing, there is an MTU issue. To resolve it, turn on
tcp_mtu_probingand reduce the MTU value of the tunnel device. -
Run the following commands to turn on
tcp_mtu_probing:echo 1 > /proc/sys/net/ipv4/tcp_mtu_probing echo 1024 > /proc/sys/net/ipv4/tcp_base_mss -
Add the following lines to the
/etc/sysctl.conffile to make the settings permanent for future system restarts:net.ipv4.tcp_mtu_probing =1 net.ipv4.tcp_base_mss = 1024 -
Complete the following steps to change the Calico IP-IP tunnel MTU after it is deployed.
-
Run the following command to edit the Calico settings file:
kubectl edit daemonset calico-node -n kube-system -
Lower the value of the FELIX_IPINIPMTU variable as to reduce the MTU. The default value is 1430.
Note: All Calico-node Pods are restarted to enable the new MTU setting.
-
-
Apply these settings to the
sysctl.conffile to all of the nodes in the cluster.
Installation can fail with a helm-api setup error
Installation can fail with the following errors during the initial deployment of the helm-api chart:
stderr: 'Error: secrets "rudder-secret" already exists'
You can view these errors in the install logs in the cluster/logs directory.
The errors occur because the Kubernetes secrets for rudder and helm-api are created in a pre-install hook. Issues like network issues that cause timeouts, incorrect configuration of the installation, and insufficient resources on the master node prevent the use of the secrets that were created by the prehook. When the deployment fails, the installer tries to deploy the chart four more times. On each retry, it fails when trying to recreate the secrets.
To resolve this issue, complete the following steps:
-
Run the uninstall procedure to remove all components of the installation.
-
Verify that the installation settings are correctly configured, including the
config.yamlfile, hosts, and other settings. -
Run the installation procedure again.
The cloudctl cm node commands may not work accurately. Use kubectl instead
Certain IBM Cloud Private CLI cloudctl cm commands (cm), such as cloudctl cm nodes and other node commands, may not work accurately. These commands are deprecated and will be removed in a later release. Use kubectl instead.
For example, for results from cloudctl cm nodes, you need to use kubectl get nodes instead.
Cannot get secret by using kubectl command when encryption of secret data at rest is enabled
When you enable encryption of secret data at rest, and use kubectl command to get the secret, sometimes you might not be able to get the secret. You might see the following error message in kube-apiserver:
Internal error occurred: invalid padding on input
This error occurs because kube-apiserver failed to decrypt the encrypted data in etcd. For more information about the issue, see Random "invalid padding on input" errors when attempting various kubectl operations .
To resolve the issue, delete the secret and re-create it. Use the following command:
kubectl -n <namespace> delete secret <secret>
For more information about encrypting secret data at rest, see Encrypting Secret Data at Rest .
Vulnerability Advisor cannot scan unsupported container images
Container images that are not supported by the Vulnerability Advisor fail the security scan.
The Security Scan column displays Failed from the Container Images page in the management console. When you select failed container image name to view more details, zero issues are detected.