Troubleshooting Secure Service Container for IBM Cloud Private

When you run into problems with Secure Service Container for IBM Cloud Private, you can refer to the following information:

For Secure Service Container partitions, use the Secure Service Container user interface to view the logs.
For Secure Service Container for IBM Cloud Private command line tool, go to the /config/logs directory on the x86 or Linux on Z server to view the logs.
For IBM Cloud Private, see the troubleshooting and support topic of IBM Cloud Private.
For the issues that you might encounter during the upgrade or rollback of Secure Service Container for IBM Cloud Private, see the Troubleshooting upgrade or rollback.

Some known issues

Secure Service Container for IBM Cloud Private command line tool failed when configuring IBM Cloud Private nodes
IBM Cloud Private installation failed with the host unresolved error
HTTP request error when accessing the IBM Cloud Private console right after the installation
"500 internal server error" when accessing the IBM Cloud Private master node
An IBM Cloud Private node on the recycled Secure Service Container partition cannot ping all of the other IBM Cloud Private nodes in the IBM Cloud Private cluster using its IP addresses
Worker or proxy nodes stop responding after the cluster has been running smoothly for a while or restarted
Gateway on the proxy node is not configured automatically after the CLI installation
DSN Server loopback issue during the IBM Cloud Private installation
False error messages about the storage requirements when installing the IBM Cloud Private
The Catalog page is empty after the IBM Cloud Private cluster is started
Actions to perform after the restart of the Secure Service Container for IBM Cloud Private components
"502 Bad Gateway" when accessing the application on IBM Cloud Private v3.1.2
OCI runtime error after GlusterFS node restarted

You can refer to the following information if you have problems using the Secure Service Container for IBM Cloud Private offering.

Secure Service Container for IBM Cloud Private command line tool failed when configuring IBM Cloud Private nodes

When you run the Secure Service Container for IBM Cloud Private command line tool to create IBM Cloud Private nodes, you might encounter some failures and the procedure cannot continue.

To resolve this issue, you can take the following steps on the x86 or Linux on Z server:

Remove the config/cluster-status.yaml file manually.

Run the uninstall command to clean up the stale entries from the failed installation.

docker run --network=host --rm -it -v $(pwd)/config:/ssc4icp-cli-installer/config ibmzcontainers/ssc4icp-cli-installer:1.1.0.3 uninstall

Run the command line tool again as instructed in the Creating IBM Cloud Private nodes by using the command line tool topic.

IBM Cloud Private installation failed with the host unresolved error

During the IBM Cloud Private installation, you might encounter the following error message:

TASK [check : Validating Hostname is resolvable] *******************************
fatal: [192.168.60.3]: FAILED! => {"changed": false, "msg": "Please configure your hostname to resolve to an externally reachable IP"}

This error occurs because the hostname of each IBM Cloud Private node cannot be connected during the IBM Cloud Private installation. You have to log in to each IBM Cloud Private node by using the SSH utility and update the /etc/hosts file by following the instructions on Configuring your cluster.

HTTP request error when accessing the IBM Cloud Private console right after the installation

If you access the IBM Cloud Private cluster console immediately after the IBM Cloud Private installation completes, you might see an error like the following in your browser:

Error making request: Error: getaddrinfo EAI_AGAIN platform-identity-provider:4300 POST http://platform-identity-provider:4300/v1/auth/token HTTP/1.1 Accept: application/json Content-Type: application/json {"client_id":"34ddcf35f1ef7f42a23678feb8b96e8b","client_secret":
"56dbb5f1a355082692fca4b91a7a8eea","code":"WAAvWJXE8sEVUWYhERlwedeC3xA0Jv",
"redirect_uri":"https://10.152.150.205:8443/auth/liberty/callback","grant_type":"authorization_code","scope":"openid email profile"} Error: getaddrinfo EAI_AGAIN platform-identity-provider:4300

This error occurs because the UI services of IBM Cloud Private is not fully up and running, and it takes some time to be ready. Therefore, you can wait a few minutes after the IBM Cloud Private installation completes, then try accessing the console again.

"500 internal server error" when accessing the IBM Cloud Private master node

After reinstalling an IBM Cloud Private cluster and then attempting to access the IBM Cloud Private master node using port 8443, the error message "500 internal server error" might be displayed in your browser.

To resolve this issue, You can try one of the following options as a workaround.

Clear the cache of the current web browser session, then open a new private web browser window for the same web browser type (for example, Mozilla Firefox) and access the IBM Cloud Private master node URL with port 8443.
Open a new web browser session on a different web browser type and then enter the IBM Cloud Private master node URL with port 8443. For example, if you encounter "500 internal server error" message on a Mozilla Firefox window, open a new web browser window using Microsoft Edge.

An IBM Cloud Private node on the recycled Secure Service Container partition cannot ping all of the other IBM Cloud Private nodes in the IBM Cloud Private cluster using its network interface

After a Secure Service Container partition recycle (including a CEC recycle), you might notice that any IBM Cloud Private node on this Secure Service Container partition cannot connect to other IBM Cloud Private nodes by using the IP addresses.

To resolve this issue, you must restart each IBM Cloud Private node on this Secure Service Container partition with the following command once the Secure Service Container partition is reactivated and back online.

shutdown -r now

Worker or proxy nodes stop responding after the cluster has been running smoothly for a while or restarted

After the IBM Cloud Private cluster has been running for a while or you restart the cluster nodes for some reason, you might notice that the nodes on the Secure Service Container partition stops responding.

To resolve the issue, you can restart the cluster nodes on each Secure Service Container partition by using REST APIs. See the Secure Service Container for IBM Cloud Private System APIs for a full list of REST API endpoints.

Generate the access token to the Secure Service Container by using the following command.
```
curl --request POST --url https://<appliance_IP>/api/com.ibm.zaci.system/api-tokens \
-H 'accept: application/vnd.ibm.zaci.payload+json' -H 'cache-control: no-cache' \
-H 'content-type: application/vnd.ibm.zaci.payload+json;version=1.0'
-H 'zaci-api: com.ibm.zaci.system/1.0' --insecure \
--data '{ "kind" : "request", "parameters" : { "user" : "<master_id>", "password" : "master_id_password" } }'
```
Where:
- appliance_IP is the Secure Service Container IP address.
- master_id is the Master user ID in the image profile (standard mode system) or the partition definition (DPM-enabled system) for the Secure Service Container partition.
- master_id_password is the Master password in the same profile or definition for the partition.
Restart each cluster node on the Secure Service Container partition by using the following command.
```
curl -X POST https://<appliance_IP>/api/com.ibm.zaas/containers/<node_name>/restart \
-H 'accept: application/vnd.ibm.zaci.payload+json' -H 'cache-control: no-cache' \
-H 'content-type: application/vnd.ibm.zaci.payload+json;version=1.0' \
-H 'zaci-api: com.ibm.zaci.system/1.0' \
-H 'authorization: Bearer '<TOKEN>'' --insecure
```
Where:
- appliance_IP is the Secure Service Container IP address.
- TOKEN is the access token from the previous step.
- node_name is the name of worker or proxy node that you have to restart. You can get the node name from the config/cluster-configuration.yaml file on the x86 or Linux on Z server. For example, worker-15001 is the name of one worker node.

Gateway on the proxy node is not configured automatically after the CLI installation

After the Secure Service Container for IBM Cloud Private CLI installation completes, you might notice that the gateway on the proxy node is not added even though the gateway was specified in the config/ssc4icp-config.yaml file. For example,

...
proxyIPConfig:
   - ipaddress: "172.16.0.4"
     subnet: "172.16.0.0/24"
     gateway: "172.16.0.1"
     parent: "vxlan0f300.1121"
...

To resolve the issue, you can manually add the gateway into the routing table on the proxy node by using the following command:

route add -net <target_subnet> gw <gateway_IP_address>

Where:

target_subnet is the destination/target IP subnet you wish to establish connectivity with. For example, where the client laptop is located. The value is 172.16.0.0/24 as indicated in the ssc4icp-config.yaml example file.

gateway_IP_address is the gateway currently accessible from the Proxy node that provides this routing connectivity. The value is 172.16.0.1 as indicated in the ssc4icp-config.yaml example file.

DSN Server loopback issue during the IBM Cloud Private installation

During the installation of IBM Cloud Private, you might encounter an error when the IBM Cloud Private installer is validating the DNS server with the following message.

fatal: [x.x.x.x] => A loopback IP is used in your DNS server configuration. For more details, see https://ibm.biz/dns-fails.

The problem might be caused by the loopback IP (127.0.0.1 or 127.0.1.1) as the DNS server. Or, the cluster node that is specified in the error message does not have a /etc/resolv.conf file.

To resolve the issue, you can set the master node IP addresses as the name server in the /etc/resolve.conf file on the x86 or Linux on Z server. For example,

...
nameserver 192.168.15.240
nameserver 10.152.151.100
# nameserver 127.0.0.53
...

False error messages about the storage requirements when installing the IBM Cloud Private

When you install the IBM Cloud Private, you might see the following error messages about the disk space validation.

TASK [check : Validating /var directory disk space on worker, proxy and custom nodes] **********************************************************************************
fatal: [192.168.19.252]: FAILED! => changed=true
cmd: |
disk=$(df -lh -BG --output=avail /var | sed '1d' | grep -oP '\d+')
[[ $disk -ge 110 ]] || (echo "/var directory available disk space ${disk}GB, it should be greater than or equal to 110 GB" 1>&2 && exit 1)
...
stdout_lines: <omitted>
...ignoring

If you have allocated sufficient disk space in the ssc4icp-config.yaml file when creating the cluster nodes, the message can be safely ignored because this is a known issue of IBM Cloud Private.

The Catalog page is empty after the IBM Cloud Private cluster is started

After you log into the IBM Cloud Private console, you might notice that the Catalog page is empty and an error message Error loading charts pops up. If you check the logs of the DNS pod by using the command kubectl logs <dns_pod_name> -n kube-system, you might see the following errors in the logs. Note that <dns_pod_name> is the name of DNS pod.

2018/12/04 15:19:20 [ERROR] 2 ... A: unreachable backend: read udp 127.0.0.1:40613->127.0.0.53:53: i/o timeout
2018/12/04 15:19:20 [ERROR] 2 ... A: unreachable backend: read udp 127.0.0.1:54930->127.0.0.53:53: i/o timeout
2018/12/04 15:19:20 [ERROR] 2 ... A: unreachable backend: read udp 127.0.0.1:54521->127.0.0.53:53: i/o timeout
2018/12/04 15:19:20 [ERROR] 2 ... A:  unreachable backend: read udp 127.0.0.1:45653->127.0.0.53:53: i/o timeout
2018/12/04 15:19:20 [ERROR] 2 ... A:  unreachable backend: read udp 127.0.0.1:60436->127.0.0.53:53: i/o timeout

The problem might be caused by the DNS in the kubernetes not being resolved correctly.

To resolve issue, try the following steps on the x86 or Linux on Z server.

Open the /etc/systemd/system/kubelet.service file, and add the --resolv-conf=/run/systemd/resolve/resolv.conf \ line into the file.
Restart the docker and kubelet services.
Run the command kubectl get pods to ensure all the pods are in the Running state, and then log into the IBM Cloud Private console.
Click Helm Repositories option on the Manage Menu, and then click the Sync repositories button.
Go to the Catalog page to check if all the helm charts are available.

Actions to perform after the restart of the Secure Service Container for IBM Cloud Private components

The Secure Service Container for IBM Cloud Private consists of different components and you might need to perform specific actions on each type of components after the component is restarted.

After a master node is restarted, follow those instructions to ensure the connectivity among cluster nodes.
1. If the master node is configured by using the ethernet interface, run this command to ensure the connectivity to other cluster nodes.
```
ip addr add 192.168.0.251/24 dev eth0
```
2. If the master node is configured by using the VLAN interface, run these commands to ensure the connectivity to other cluster nodes.
```
ip link add link ens224 name ens224.1121 type vlan id 1121
ip addr add 192.168.0.251/24 dev ens224.1121
ip link set up ens224.1121
```
3. Check the IPSec service status by using this command.
```
service strongswan status
```
4. If the IPSec service is not running, enable the service by using this command.
```
service strongswan start
```
If the cluster nodes are in a layer 3 network, run the command ip route add 10.162.161.0/24 via 10.152.151.1 on the proxy node to configure the connectivity to the master node.
After a Secure Service Container partition is restarted, you have to run all the required commands on each worker or proxy node that are hosted on this partition.

"502 Bad Gateway" when accessing the application on IBM Cloud Private v3.1.2

After you deploy the application on the IBM Cloud Private version 3.1.2, you might notice the "502 Bad Gateway" error message when accessing the application. After checking the logs, you might see similars error messages from the ngins-ingress-controller pod.

2019/03/27 07:11:29 [error] 58#58: *48168 upstream prematurely closed connection while reading response header from upstream, client: 127.0.0.1, server: _, request: "GET /favicon.ico HTTP/2.0", upstream: "http://10.1.125.134:9443/favicon.ico", host: "9.20.36.107", referrer: "https://9.20.36.107/"
2019/03/27 07:11:29 [error] 58#58: *48168 upstream prematurely closed connection while reading response header from upstream, client: 127.0.0.1, server: _, request: "GET /favicon.ico HTTP/2.0", upstream: "http://10.1.125.134:9443/favicon.ico", host: "9.20.36.107", referrer: "https://9.20.36.107/"

The problem is caused by a new prefix in the NGINX ingress version 0.9.0, which is part of IBM Cloud Private version 3.1.2 nginx.ingress.kubernetes.io. You can refer to Enable Ingress Controller to use a new annotation prefix for the details.

To resolve the problem, try the following steps:

Apply the workaround as described in the Enable Ingress Controller to use a new annotation prefix.
Check the ingress pod of the application to ensure nginx.ingress.kubernetes.io/backend-protocol: HTTPS annotation is available by using the following command.
```
kubectl get ingress oldone-ibm-open-liberty -o yaml
```
If the annotation is not displayed in the command result, add the annotation into the ingress deployment by using the following command.
```
kubectl edit ingress oldone-ibm-open-liberty -o yaml
```
Access the application again.

OCI runtime error after GlusterFS node restarted

After you restart one of the GlusterFS node, you might encounter the OCI runtime error with the glusterfs daemon. For example, run the command kubectl -n kube-system exec storage-glusterfs-glusterfs-daemonset-8n9g6 -- ll /var/lib/glusterd.

The error message might be as the following:

OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "exec: \"ll\": executable file not found in $PATH": unknown
command terminated with exit code 126

The problem is caused by missing configurations in the glusterfs daemonset.

To resolve the problem, complete the following tasks on the master node server.

Retrieve the glusterfs daemonset name.

kubectl --all-namespaces get daemonset | grep glusterfs

Open the daemonset editor by using the glusterfs daemonset name. For example,
```
kubectl edit daemonset <gluster_daemonset_name> --all-namespaces
```

Apply the following changes for the daemonset.

Update the livenessProbe and readinessProbe sections with the following configuration.
failureThreshold: 50
initialDelaySeconds: 40
periodSeconds: 25
successThreshold: 1
timeoutSeconds: 3
Add the following lines into the volumeMounts section.
- mountPath: /lib/modules
name: kernel-modules
readOnly: true
- mountPath: /etc/ssl
name: glusterfs-ssl
readOnly: true
Add the following lines into the volumes section.
- name: glusterfs-ssl
hostPath:
path: "/etc/ssl"
- name: kernel-modules
hostPath:
path: "/lib/modules"

Save the changes and exit.

The changes will be applied automatically. By monitoring the output of the command kubectl -n kube-system get po -o wide|grep glusterfs, you can see glsuterfs daemonset pods terminating and coming up gradually.