Troubleshooting

The first step in troubleshooting an Operational Decision Manager instance on Certified Kubernetes is triage. In most cases, the problem is likely to be 1 of 4 things: your pods, your replication, your service, or your database.

About this task

On Kubernetes, Operational Decision Manager services are run in pods.

  • A pod is a wrapper around a single docker container.
  • Replicas are used by Deployments as a mechanism to create, delete, and update pods. Ordinarily, you do not have to worry about managing the replicas that deployments create. Deployments own and manage their replicas. You can specify how many pods to run concurrently by setting .spec.replicaCount.
  • Services balance the load across a set of pods.
  • The database is where the data is persisted.

Procedure

  1. List all of the container images that are running in your cluster.

    To target the pods in a specific namespace, use the namespace flag.

    kubectl get pods -n <NAMESPACE>
  2. Check the current state and recent events of your pods to see whether they are all running.
    Browse to your pods by using the kubectl command line tool.
    kubectl describe pods POD_NAME

    Where POD_NAME is optional and is used to select a specified pod. If you did not use the default namespace, add the -n NAMESPACE_NAME parameter.

    1. If a pod is stuck in Pending, look at the output of the kubectl describe command.
      Find the messages from the scheduler about why it cannot schedule your pod. The most likely reason is that you do not have enough resources. The supply of CPU or Memory in your cluster might be exhausted, in this case you need to delete pods, adjust resource requests, or add new nodes to your cluster.
    2. If a pod is stuck in the Waiting state or Init:ImagePullBackOff, look at the output of the kubectl describe command.
      The most common cause of Waiting pods is a failure to pull the image. If you installed from the command line check that the name of the image is correct, check that you pushed the image to the repository, and try to pull the image.
    3. If a pod fails or is otherwise unhealthy, look at the logs of the current pod.
      kubectl logs POD_NAME 
      If your pod previously failed, add the previous argument to access these logs.
      kubectl logs --previous POD_NAME 
      Or you can run commands inside that pod with exec.
      kubectl exec POD_NAME -- CMD ARG1 ARG2 ... ARGN

      For example, to get a shell to the running pod.

      kubectl exec -ti POD_NAME -- /bin/bash 

      In your shell, list the root directory and use other commands to view the configuration.

      root@shell:/# ls 
      root@shell:/# cat /logs/messages.logs
      root@shell:/# cat /config/server.xml
      root@shell:/# cat /config/datasource.xml
      root@shell:/# cat /proc/mounts 
      root@shell:/# cat /proc/1/maps 
      root@shell:/# apt-get update 
      root@shell:/# apt-get install -y tcpdump 
      root@shell:/# tcpdump 
      root@shell:/# apt-get install -y lsof 
      root@shell:/# lsof 
      root@shell:/# apt-get install -y procps 
      root@shell:/# ps aux
    4. If a pod is stuck Running and does not turn to Ready after a while, it might be that the health check takes longer than the readiness and liveness timeout values. When you edit an existing deployment, another pod is created with the new timeout values.
  3. If pods cannot be created to replace removed pods, use describe rs to look at the events that are related to the replicas.
    kubectl describe rs
  4. Check whether the services are working correctly.
    1. Verify the endpoints for your services by running the get endpoints command.
      kubectl get endpoints SERVICE_NAME

      For every service, an endpoint resource is made available. Each ODM for production component runs in a pod in a separate service so the number of endpoints is expected to be 1 per service.

      To get information on a specified pod, enter the following command.

      kubectl get pods --selector=release=mycompany-dev1,run=mycompany-dev1-odm-decisionrunner

      If the list of pods matches expectations and your endpoint is still empty, it is possible that the ports are not exposed. If your service specifies a port, but the pod does not list that port then the port is not added to the endpoints list. Verify that the port of the pod matches the port of the service.

    2. Get more targeted information on the pod by running the kubectl get and describe commands.
      kubectl get pod POD_NAME --output=yaml
      kubectl describe pod POD_NAME

      Look for the restart count, which might indicate a problem.

  5. Check whether the internal database container is up and running.

    If an internal database is configured, an init container checks whether a persistent volume is available before it deploys the Operational Decision Manager containers.

    If the database is down, the cause is probably 1 of 2 things.
    • The persistent volume is not available.
    • The database does not have enough memory.
  6.  New in 19.0.3  With an operator, you can set the debug configuration parameter to add more tracing at any time.

    In the odm_configuration part of the custom resource YAML file, set the value of the debug parameter to true.

    odm_configuration: 
       debug: true

    To apply the modified .cr file, run the following command.

    kubectl apply -f <modifiedCrWithDebugFlag>

    To view the trace in the operator pods, run the following command.

    kubectl logs <OperatorPodId> 

Results

If nothing looks wrong in your configuration and you continue to get no response when you try to access your service, see Debug Services.

What to do next

If the Operational Decision Manager instance is working correctly, but the application is not working as you expect then inspect the Operational Decision Manager logs. If you need, change the logging levels to get more detail on the suspected problem.

Operational Decision Manager runs on Liberty profile, which uses a unified logging component for handling messages. The logging component also provides First Failure Data Capture (FFDC) services, and unifies the messages that are written to System.out, System.err, and java.util.logging with other messages. The logging component is controlled through the server configuration.

You customize the logging properties by adding logging elements to a server configuration file and then creating a Kubernetes configMap to apply to the configuration. The configuration of the log level uses the following format:

<component> = <level>

Where <component> is the component for which to set a log level, and <level> is one of the valid logger levels. For more information, see Customizing log levels.