Troubleshooting for Operational Decision Manager on Certified Kubernetes

The first step in troubleshooting an Operational Decision Manager deployment on Certified Kubernetes is triage. In most cases, the problem is likely to be one of four things: your pods, your replicas, your services, or your database.

About this task

On Kubernetes, Operational Decision Manager services are run in pods.

A pod is a group of one or more containers with shared storage and network. For Operational Decision Manager, each pod wraps a single docker container.
Replicas are used by Deployments as a mechanism to create, delete, and update pods. Ordinarily, you do not have to worry about managing the replicas that deployments create. Deployments own and manage their replicas. You can specify how many pods to run concurrently by setting .spec.replicaCount.
Services balance the load across a set of pods.
The database is where the data is persisted.

Procedure

The Kubernetes command-line tool, kubectl External link opens a new window or tab , allows you to run commands against Kubernetes clusters.

Display the status of the named release.
```
helm status RELEASE_NAME
```
List all of the container images that run in your cluster.
To target the pods in a specific namespace, use the namespace flag.
```
$ kubectl get pods -n NAMESPACE
```
Check the current state and recent events of your pods to see whether they are all running.
You can get more targeted information on the pod by running the kubectl get and describe commands.
```
$ kubectl get pod POD_NAME --output=yaml
$ kubectl describe pod POD_NAME
```
1. If the pods are not created, check the status of the replica sets and inspect the events for one of them by running:
```
$ kubectl get replicaset -n NAMESPACE
$ kubectl describe replicaset -n NAMESPACE replicaset_name
```
  If you find an error similar to the following one:
```
unable to  validate against any security context constraint:  [spec.containers[0].securityContext.securityContext.runAsUser:
Invalid  value: 1001: must be in the ranges: [1000140000, 1000149999]]
```
  Check that the SecurityContextConstraints (scc) are correctly granted to the serviceAccount used in your deployment. For details, see Preparing to install Operational Decision Manager.
2. If a pod is stuck in Pending, look at the output of the kubectl describe command.
  Find the messages from the scheduler about why it cannot schedule your pod. The most likely reason is that you do not have enough resources. The supply of CPU or Memory in your cluster might be exhausted, in this case you need to delete pods, adjust resource requests, or add new nodes to your cluster.
3. If a pod is stuck in the Waiting state or Init:ImagePullBackOff, look at the output of the kubectl describe command.
  The most common cause of Waiting pods is a failure to pull the image. If you installed a release from the command line check that the name of the image is correct, check that you pushed the image to the repository, and try to pull the image.
4. If a pod is stuck in Running state and does not turn to Ready state after a while, it might be that the health check takes longer than the readiness and liveness timeout values. When you edit an existing deployment, another pod is created with the new timeout values.
  
  Note: If a Decision Center pod is stuck in Running and does not change to Ready after a while, the database configuration might be broken and the pod never starts. To resolve the problem, clean the database and increase the readinessProbe values. The readiness by default times-out in approximately 5 minutes, so if there is some cluster slowness, network or database latency, then double the number.
5. If a job or pod named <RELEASE_NAME>-odm-usage-metering-xxxx is in an error state, check the logs to verify that the Usage Metering Service is correctly installed and running.
  Run the following command to inspect the logs:
```
bash
kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=ibm-usage-metering-instance --tail=100
```
  For more information about how to install the Usage Metering Service, see Installing the Usage Metering Service.
6. If a pod is Evicted, it might have reached the ephemeral storage limit. Check the logs for the following error message:
```
Pod ephemeral local storage usage exceeds the total limit of containers 500Mi
```
  To fix the issue, increase the resources.limits.ephemeral-storage value of the corresponding component. See Production configuration parameters to find the default values.
7. If a pod fails or is otherwise unhealthy, look at the logs of the current pod.
```
$ kubectl logs POD_NAME 
```
  If your pod previously failed, add the previous argument to access these logs.
```
$ kubectl logs --previous POD_NAME 
```
  Or you can run commands inside that pod with exec.
```
$ kubectl exec POD_NAME -- CMD ARG1 ARG2 ... ARGN
```
  Note: The double-dash symbol -- is used to separate the arguments that you want to pass to the command from the kubectl arguments.
  For example, to get a shell to the running pod.
```
$ kubectl exec -ti POD_NAME -- /bin/bash 
```
  In your shell, list the root directory and use other commands to view the configuration. For example:
```
bash-4.4$ ls 
bash-4.4$ cat /config/server.xml
bash-4.4$ cat /config/datasource.xml
bash-4.4$ cat /proc/mounts 
bash-4.4$ cat /proc/1/maps
```
Check whether the services are working correctly.
Verify the endpoints for your services by running the get endpoints command.
```
$ kubectl get endpoints SERVICE_NAME
```
For every service, an endpoint resource is made available. Each ODM for production component runs in a pod in a separate service so the number of endpoints is expected to be 1 per service.

To get information on a specified pod, enter the following command.
```
$ kubectl get pods --selector=release=mycompany-dev1,run=mycompany-dev1-odm-decisionrunner
```
If the list of pods matches expectations and your endpoint is still empty, it is possible that the ports are not exposed. If your service specifies a port, but the pod does not list that port then the port is not added to the endpoints list. Verify that the port of the pod matches the port of the service.
Check whether the internal database container is up and running.
If an internal database is configured, an init container checks whether a persistent volume is available before it deploys the Operational Decision Manager containers.
If the database is down, the probable causes are:
- There is no available persistent volume.
- There is not enough memory for the database.

Results

If nothing looks wrong in your configuration and you continue to get no response when you try to access your service, see Debug Services External link opens a new window or tab .

What to do next

If the Operational Decision Manager deployment is working correctly, but the product is not working as you expect then inspect the Operational Decision Manager logs. If you need, change the logging levels to get more detail on the suspected problem.

Operational Decision Manager runs on Liberty profile, which uses a unified logging component for handling messages. The logging component also provides First Failure Data Capture (FFDC) services, and unifies the messages that are written to System.out, System.err, and java.util.logging with other messages. The logging component is controlled through the server configuration.

You customize the logging properties by adding logging elements to a server configuration file and then creating a Kubernetes configMap to apply to the configuration. The configuration of the log level uses the following format:

component = level

Where component is the component for which to set a log level, and level is one of the valid logger levels. For more information, see Customizing log levels.

For tips on resolving issues with OpenID, see Troubleshooting LDAP External link opens a new window or tab and Troubleshooting OpenID .