Administering Kubernetes

Provides a quick overview of commonly used commands to troubleshoot a cluster with direct application to IBM Financial Crimes Insight for Watson, Private. For in-depth Kubernetes information, see Kubernetes product information.

Examining the current state

To find out the current state of the cluster, the following kubectl commands are available:

Examining node health

A node is a Kubernetes term for either a single virtual machine or a physical server. Kubernetes has as master node that orchestrates other nodes. These "other nodes" are referred to as worker nodes. The worker nodes contain pods and these pods run the Docker containers. For users, all user access is through the master node; thus, the URL to access FCI includes the fully qualified host name of the master node.
Use the following command to find out the health of the nodes:
kubectl get nodes

Command output shows all nodes in the cluster, and the current state of the nodes:

NAME                                 STATUS    AGE       VERSION
fcikuber-mst.rtp.raleigh.ibm.com     Ready     18h       v1.7.3
fcinode92.rtp.raleigh.ibm.com        Ready     18h       v1.7.3
fcinode93.rtp.raleigh.ibm.com        Ready     18h       v1.7.3
fcinode94.rtp.raleigh.ibm.com        Ready     18h       v1.7.3
fcinode95.rtp.raleigh.ibm.com        Ready     18h       v1.7.3

Output that specifies a status other than Ready indicates a problem. Nodes can be taken in and out of service, and their status is reflected. When nodes are facing pressure that is related to resources, the status is also indicated. For example, OutOfDisk indicates that the file system on the worker node is full. Kubernetes begins moving pods off the node until the situation is fixed and the status of the node moves back to Ready.

When the Kubernetes cluster is initially created, nodes start in NotReady state. After the node is ready to accept jobs, the status automatically moves to Ready state.

To get more details about a node, enter the following command:

kubectl describe node fcinode95.rtp.raleigh.ibm.com

The output of the command provides details about the particular node:

Name:                   fcinode95.rtp.raleigh.ibm.com
Role:
Labels:                 beta.kubernetes.io/arch=amd64
                        beta.kubernetes.io/os=linux
                        kubernetes.io/hostname=fcinode95.rtp.raleigh.ibm.com
Annotations:            flannel.alpha.coreos.com/backend-data={"VtepMAC":"f6:e3:ed:e5:12:91"}
                        flannel.alpha.coreos.com/backend-type=vxlan
                        flannel.alpha.coreos.com/kube-subnet-manager=true
                        flannel.alpha.coreos.com/public-ip=9.37.30.95
                        node.alpha.kubernetes.io/ttl=0
                        volumes.kubernetes.io/controller-managed-attach-detach=true
Taints:                 <none>
CreationTimestamp:      Thu, 28 Sep 2017 14:54:02 -0400
Conditions:
  Type                  Status  LastHeartbeatTime                       LastTransitionTime                      Reason                          Message
  ----                  ------  -----------------                       ------------------                      ------                          -------
  OutOfDisk             False   Fri, 29 Sep 2017 09:56:01 -0400         Thu, 28 Sep 2017 14:54:02 -0400         KubeletHasSufficientDisk        kubelet has sufficient disk space available
  MemoryPressure        False   Fri, 29 Sep 2017 09:56:01 -0400         Thu, 28 Sep 2017 14:54:02 -0400         KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure          False   Fri, 29 Sep 2017 09:56:01 -0400         Thu, 28 Sep 2017 14:54:02 -0400         KubeletHasNoDiskPressure        kubelet has no disk pressure
  Ready                 True    Fri, 29 Sep 2017 09:56:01 -0400         Thu, 28 Sep 2017 14:54:23 -0400         KubeletReady                    kubelet is posting ready status
Addresses:
  InternalIP:   9.37.30.95
  Hostname:     fcinode95.rtp.raleigh.ibm.com
Capacity:
 cpu:           4
 memory:        8175660Ki
 pods:          110
Allocatable:
 cpu:           4
 memory:        8073260Ki
 pods:          110
System Info:
 Machine ID:                    cd344ed1a09e4df484ef7248fc04281a
 System UUID:                   4234C0FE-FE90-4C08-4D87-4C031A9D7331
 Boot ID:                       b4f3b1f5-ddd6-461b-807d-82af46a20f4c
 Kernel Version:                3.10.0-514.26.2.el7.x86_64
 OS Image:                      CentOS Linux 7 (Core)
 Operating System:              linux
 Architecture:                  amd64
 Container Runtime Version:     docker://Unknown
 Kubelet Version:               v1.7.3
 Kube-Proxy Version:            v1.7.3
PodCIDR:                        10.244.2.0/24
ExternalID:                     fcinode95.rtp.raleigh.ibm.com
Non-terminated Pods:            (4 in total)
  Namespace                     Name                                           CPU Requests     CPU Limits      Memory Requests Memory Limits
  ---------                     ----                                           ------------     ----------      --------------- -------------
  default                       db2-202245354-rl7gr                            0 (0%)           0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-flannel-ds-h9c9z                          0 (0%)           0 (0%)          0 (0%)          0 (0%)
  kube-system                   kube-proxy-knjh2                               0 (0%)           0 (0%)          0 (0%)          0 (0%)
  kube-system                   monitoring-grafana-1219411114-9bts0            0 (0%)           0 (0%)          0 (0%)          0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits      Memory Requests Memory Limits
  ------------  ----------      --------------- -------------
  0 (0%)        0 (0%)          0 (0%)          0 (0%)
Events:         <none>

Examining pod health

FCI applications are deployed into pods in the default namespace. For FCI, a pod is simply a single Docker container.

The following command provides a list of pods that are currently installed and their status:

kubectl get pods

Output is similar to the following:

NAME                         READY     STATUS    RESTARTS   AGE
analytics-2179514482-2kf6k   1/1       Running   0          18h
db2-202245354-rl7gr          1/1       Running   0          18h
mq-2553988797-6gmm4          1/1       Running   0          18h
solution-150650449-n1nnz     1/1       Running   0          18h

The READY and STATUS fields are important. Pods go through a lifecycle and some of the steps are lengthy. Until a pod has a STATUS of Running, the Ready field is at least 1/1, the application is not up. Kubernetes monitors pod status and does not route traffic to pods that are not Ready.

Situations occur where you might need to enter a running Docker container to perform command line operations. For example, during problem determination, you might want to see if a process is running (ps -ef), or you might turn on tracking for a software component, such as WebSphere Liberty. The syntax for entering a Docker container (pod) is as follows:
kubectl exec -it name_of_pod -- /bin/bash
For example, to enter the analytics Docker container, log in as root on the master node and enter the following command:
kubectl exec -it analytics-2179514482-2kf6k -- /bin/bash

Then, while in the container, enter commands at the Linux prompt to troubleshoot the specific product that is running in the container.

You can also run a command while logged in to the master node to debug containers without entering the container. For example, you can use the following syntax to run any command inside a container:
 
kubectl exec -it name_of_pod -- /bin/bash -c "any_linux_string_of_commands"
For example, to see if WebSphere MQ processes are running in the WebSphere MQ pod, enter this command:
kubectl exec -it mq-2553988797-6gmm4 -- /bin/bash -c "ps -ef | grep amq"
Output similar to the following is displayed:
mqm 213 0 0 Feb09 ? 00:00:22 /opt/mqm/bin/amqzxma0 -m FCIQM -u mqm
mqm 221 213 0 Feb09 ? 00:00:03 /opt/mqm/bin/amqzfuma -m FCIQM
mqm 226 213 0 Feb09 ? 00:00:14 /opt/mqm/bin/amqzmuc0 -m FCIQM
mqm 245 213 0 Feb09 ? 00:00:54 /opt/mqm/bin/amqzmur0 -m FCIQM
........
Stopping and starting the Liberty solution server:
If you notice that the Liberty solution server is not running (0/1 is displayed in the Ready column):
kubectl get pods

NAME READY STATUS RESTARTS AGE
fci-analytics-3232108353-8g3rt 1/1 Running 0 53d
fci-messaging-409855742-ngzw5 1/1 Running 0 53d
fci-primaryds-3918237058-p1d7v 1/1 Running 0 53d
fci-solution-1268297780-v0jpp 0/1 Running 0 53d

Use the following commands to get status, start, and stop the Liberty solution server.

  • To get server status and return the server's process ID:
    kubectl exec -it fci-solution-1268297780-v0jpp -- /opt/ibm/wlp/bin/server status solutionServer
  • To stop the Liberty solution server:
    kubectl exec -it fci-solution-1268297780-v0jpp -- /opt/ibm/wlp/bin/server stop solutionServer
    Note: To see if the Liberty solution server stopped, change to the logs directory to view the messages.log and console.log files:
    cd /fci-exports/fci-solution/servers/solutionServer/logs
    If the Liberty solution server did not stop, go to the Kubernetes container and locate the Liberty Java process. Enter a kill -9 command on the Java process using the PID obtained from the server status solutionServer command. For example:
    kubectl exec -it fci-solution-1268297780-v0jpp -- /bin/bash
    wlpadmin@fci-solution-1268297780-v0jpp:/$ kill -9 11306
    To see if the Java process for the Liberty solution server stopped:
    wlpadmin@fci-solution-1268297780-v0jpp:/$ ps -ef | grep java
  • To start the Liberty solution server:
    kubectl exec -it fci-solution-1268297780-v0jpp -- /opt/ibm/wlp/bin/server start solutionServer --clean
Notes about specific pods:
  • The WebSphere MQ pod is in Init:0/1 state while files that are needed by the analytics and solution pods are copied into the WebSphere MQ persistent volume.
  • Both the analytics and solution pods are 0/1 Ready and Running status while the WebSphere Liberty server starts. Kubernetes monitors both servers until they respond to browser traffic before declaring the pods Ready. Several factors can impact the start time of the pods. From the time the install.sh script is run until the platform is ready for a user to log in is approximately ten minutes.
  • Kubernetes also monitors the availability of the WebSphere MQ pod as part of the liveness probe of the analytics and solution pods. If the WebSphere MQ pod becomes unavailable for more than 5 seconds, the analytics and solution pods are stopped and restarted. While waiting for WebSphere MQ, analytics, and the FCI solution to restart, the system is disabled.
To get more information about a specific pod, use the describe pod command:
kubectl describe pod solution-150650449-n1nnz

It is important to review the events section to see if the pod is in a normal starting state, or if something went wrong and needs to be addressed.

Name:           solution-150650449-n1nnz
Namespace:      default
Node:           fcinode93.rtp.raleigh.ibm.com/9.37.30.93
Start Time:     Thu, 28 Sep 2017 15:08:21 -0400
Labels:         app=solution
                pod-template-hash=150650449
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"solution-150650449","uid":"64bd7878-a480-11e7-8963-005056b446a4"...
Status:         Running
IP:             10.244.3.5
Created By:     ReplicaSet/solution-150650449
Controlled By:  ReplicaSet/solution-150650449
Init Containers:
  init-mqservice:
    Container ID:       docker://eb3459ab60f0dd7855dcb7b6ba2957328dc7dadec32a3c1424b8650d8b91fa60
    Image:              giantswarm/tiny-tools
    Image ID:           docker-pullable://giantswarm/tiny-tools@sha256:8e6739b0083c8d67e0ad8aef98c60cd84881698e87bd23c102b1f782894c7bfe
    Port:               <none>
    Command:
      fish
      -c
      echo "waiting for fci-messaging..."; while true; set endpoints (curl -s --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt --header "Authorization: Bearer "(cat /var/run/secrets/kubernetes.io/serviceaccount/token) https://kubernetes.default.svc/api/v1/namespaces/default/endpoints/fci-messaging); echo $endpoints | jq "."; if test (echo $endpoints | jq -r ".subsets[]?.addresses // [] | length") -gt 0; exit 0; end; echo "waiting...";sleep 1; end
    Args:
      default
      fci-messaging
    State:              Terminated
      Reason:           Completed
      Exit Code:        0
      Started:          Thu, 28 Sep 2017 15:08:22 -0400
      Finished:         Thu, 28 Sep 2017 15:11:50 -0400
    Ready:              True
    Restart Count:      0
    Environment:        <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from init-container-serviceaccount-token-l73n3 (ro)
  init-db2service:
    Container ID:       docker://cb27cf5f3e550a5858e3f6482142394a4a2b68f025ce6c473365865511c97853
    Image:              giantswarm/tiny-tools
    Image ID:           docker-pullable://giantswarm/tiny-tools@sha256:8e6739b0083c8d67e0ad8aef98c60cd84881698e87bd23c102b1f782894c7bfe
    Port:               <none>
    Command:
      fish
      -c
      echo "waiting for fci-primaryds..."; while true; set endpoints (curl -s --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt --header "Authorization: Bearer "(cat /var/run/secrets/kubernetes.io/serviceaccount/token) https://kubernetes.default.svc/api/v1/namespaces/default/endpoints/fci-primaryds); echo $endpoints | jq "."; if test (echo $endpoints | jq -r ".subsets[]?.addresses // [] | length") -gt 0; exit 0; end; echo "waiting...";sleep 1; end
    Args:
      default
      fci-primaryds
    State:              Terminated
      Reason:           Completed
      Exit Code:        0
      Started:          Thu, 28 Sep 2017 15:11:51 -0400
      Finished:         Thu, 28 Sep 2017 15:11:51 -0400
    Ready:              True
    Restart Count:      0
    Environment:        <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from init-container-serviceaccount-token-l73n3 (ro)
Containers:
  solution:
    Container ID:       docker://49506f18f0758c361a3e74c5d2f3d2761e17a3fda4ba116f6cfe2d0986bbf807
    Image:              pltdockerrel.rtp.raleigh.ibm.com:5000/ibmcom/fci-solution:1.0.1
    Image ID:           docker-pullable://pltdockerrel.rtp.raleigh.ibm.com:5000/ibmcom/fci-solution@sha256:d7c6696330ef7518caf750faa2394b498ca1fdcd1ff2f6c8c95b08a298f4f8cb
    Ports:              9080/TCP, 9443/TCP
    Command:
      /fci-solution/solution-kube-start.sh
    State:              Running
      Started:          Thu, 28 Sep 2017 15:11:53 -0400
    Ready:              True
    Restart Count:      0
    Liveness:           exec [/fci-solution/solution-kube-live.sh] delay=5s timeout=1s period=5s #success=1 #failure=3
    Readiness:          http-get https://:9443/console delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment Variables from:
      platform-config   ConfigMap       Optional: false
    Environment:        <none>
    Mounts:
      /fci-shared from shared-log-persistent-storage (rw)
      /fci-solution from solution-log-persistent-storage (rw)
      /opt/mqm from solution-mq-persistent-storage (rw)
      /var/mqm from solution-mq-persistent-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from init-container-serviceaccount-token-l73n3 (ro)
Conditions:
  Type          Status
  Initialized   True
  Ready         True
  PodScheduled  True
Volumes:
  solution-log-persistent-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  solution-log-claim
    ReadOnly:   false
  solution-mq-persistent-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  mq-log-claim
    ReadOnly:   false
  shared-log-persistent-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  shared-liberty-claim
    ReadOnly:   false
  init-container-serviceaccount-token-l73n3:
    Type:       Secret (a volume populated by a Secret)
    SecretName: init-container-serviceaccount-token-l73n3
    Optional:   false
QoS Class:      BestEffort
Node-Selectors: <none>
Tolerations:    node.alpha.kubernetes.io/notReady:NoExecute for 300s
                node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:         <none>

Common event messages are listed.

  • Failed to pull image "…" 
    Indicates that the container image cannot be pulled from the FCI Docker registry. Further details in the message indicate the nature of the problem. Verify the FCI Docker registry address carefully to make sure that it matches the actual configured FCI Docker registry. Error messages that are related to SSL handshaking indicate that the/etc/docker/daemon.json file for insecure registries was not updated properly.
  • Unable to mount volumes for pod
    This message was observed when the FCI was uninstalled and then reinstalled. To protect the data on the persistent volume, the volumes are configured to go to RETAIN state after the persistent volume claim is deleted. This action allows the data to be backed up before the volume is assigned to a new container. To make the volume available again, you must delete the PV record and then re-create it.
  • Readiness probe failed
    All of the pods in the platform have readiness probes that are associated with them that keep the pods from being marked Ready before they are able to process requests. The nature of the readiness probe varies based on the pod; however, in all cases, the readiness probe prevents the pod from becoming ready. In some cases, this happens even after the pod is ready.
    Using the following kubectl command provides access to a shell prompt inside of the running container. At that stage, you can troubleshoot the readiness probe. Common causes are permissions problems, a file missing from the persistent volume, or something is wrong with the process and the readiness probe is correct. For example:
    kubectl -it exec pod_name -- /bin/bash
  • test: Missing argument at index 2

    This issue was observed intermittently on some clusters. According to Kubernetes issue #51881, a conflict exists between Kubernetes and Docker networking policies. To resolve this issue, run iptables -P FORWARD ACCEPT on every node in the cluster.

Kubernetes system pods are in their own namespace, kube-system. To examine namespaces and the status of system pods, the following commands are available:

NAME          STATUS    AGE
default       Active    19h
kube-public   Active    19h
kube-system   Active    19h
kubectl get namespaces
kubectl -n kube-system get pods
[root@fciva103 ~]# kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
calico-etcd-5970v 1/1 Running 2 89d
calico-kube-controllers-2305157444-42wg1 1/1 Running 3 89d
calico-node-b9ld2 2/2 Running 6 89d
default-http-backend-654799587-2trr5 1/1 Running 2 89d
etcd-fciva103u01.fyre.ibm.com 1/1 Running 2 89d
kube-apiserver-fciva103u01.fyre.ibm.com 1/1 Running 2 89d
kube-controller-manager-fciva103u01.fyre.ibm.com 1/1 Running 3 89d
kube-dns-2168611686-5l73b 3/3 Running 6 89d
kube-proxy-hstlr 1/1 Running 2 89d
kube-scheduler-fciva103u01.fyre.ibm.com 1/1 Running 3 89d
nginx-ingress-controller-1824274788-wmf71 1/1 Running 2 89d