The first step in troubleshooting an Operational Decision
Manager instance on Certified Kubernetes is triage. In
most cases, the problem is likely to be 1 of 4 things: your pods, your replication, your service, or
your database.
About this task
On Kubernetes, Operational Decision
Manager services are run in
pods.
- A pod is a wrapper around a single docker container.
- Replicas are used by Deployments as a mechanism to create, delete, and update pods. Ordinarily,
you do not have to worry about managing the replicas that deployments create. Deployments own and
manage their replicas. You can specify how many pods to run concurrently by setting
.spec.replicaCount
.
- Services balance the load across a set of pods.
- The database is where the data is persisted.
Procedure
-
List all of the container images that are running in your cluster.
To target the pods in a specific namespace, use the namespace flag.
kubectl get pods -n <NAMESPACE>
-
Check the current state and recent events of your pods to see whether they are all
running.
Browse to your pods by using the kubectl
command line tool.
kubectl describe pods POD_NAME
Where POD_NAME
is optional and is used to select a specified
pod. If you did not use the default namespace, add the -n
NAMESPACE_NAME
parameter.
-
If a pod is stuck in Pending, look at the output of the
kubectl describe
command.
Find the messages from the scheduler about why it cannot schedule your pod. The most likely
reason is that you do not have enough resources. The supply of CPU or Memory in your cluster might
be exhausted, in this case you need to delete pods, adjust resource requests, or add new nodes to
your cluster.
-
If a pod is stuck in the Waiting state or
Init:ImagePullBackOff, look at the output of the
kubectl
describe
command.
The most common cause of Waiting pods is a failure to pull the
image. If you installed from the command line check that the name of the image is correct, check
that you pushed the image to the repository, and try to pull the image.
-
If a pod fails or is otherwise unhealthy, look at the logs of the current pod.
kubectl logs POD_NAME
If your pod previously
failed, add the
previous
argument to access these logs.
kubectl logs --previous POD_NAME
Or you can run commands
inside that pod with
exec
.
kubectl exec POD_NAME -- CMD ARG1 ARG2 ... ARGN
For example, to get a shell to the running pod.
kubectl exec -ti POD_NAME -- /bin/bash
In your shell, list the root directory and use other commands to view the configuration.
root@shell:/# ls
root@shell:/# cat /logs/messages.logs
root@shell:/# cat /config/server.xml
root@shell:/# cat /config/datasource.xml
root@shell:/# cat /proc/mounts
root@shell:/# cat /proc/1/maps
root@shell:/# apt-get update
root@shell:/# apt-get install -y tcpdump
root@shell:/# tcpdump
root@shell:/# apt-get install -y lsof
root@shell:/# lsof
root@shell:/# apt-get install -y procps
root@shell:/# ps aux
-
If a pod is stuck Running and does not turn to
Ready after a while, it might be that the health check takes longer
than the readiness and liveness timeout values. When you edit an existing deployment, another pod is
created with the new timeout values.
-
If pods cannot be created to replace removed pods, use
describe rs
to look at
the events that are related to the replicas.
-
Check whether the services are working correctly.
-
Verify the endpoints for your services by running the
get endpoints
command.
kubectl get endpoints SERVICE_NAME
For every service, an endpoint resource is made available. Each ODM for production component runs
in a pod in a separate service so the number of endpoints is expected to be 1 per service.
To get information on a specified pod, enter the following command.
kubectl get pods --selector=release=mycompany-dev1,run=mycompany-dev1-odm-decisionrunner
If the list of pods matches expectations and your endpoint is still empty, it is possible that
the ports are not exposed. If your service specifies a port, but the pod does not list that port
then the port is not added to the endpoints list. Verify that the port of the pod matches the port
of the service.
-
Get more targeted information on the pod by running the
kubectl
get
and describe
commands.
kubectl get pod POD_NAME --output=yaml
kubectl describe pod POD_NAME
Look for the restart count, which might indicate a problem.
- Check whether the internal database container is up and running.
If an internal database is configured, an init container checks whether a
persistent volume is available before it deploys the Operational Decision
Manager containers.
If the database is down, the cause is probably 1 of 2 things.
- The persistent volume is not available.
- The database does not have enough memory.
-
New in 19.0.3 With an operator, you can
set the debug configuration parameter to add more tracing at any time.
In the odm_configuration part of the custom resource YAML file, set the
value of the debug parameter to true
.
odm_configuration:
debug: true
To apply the modified .cr file, run the following command.
kubectl apply -f <modifiedCrWithDebugFlag>
To view the trace in the operator pods, run the following command.
kubectl logs <OperatorPodId>
Results
If nothing looks wrong in your configuration and you continue to get no response when you try to
access your service, see Debug Services.
What to do next
If the Operational Decision
Manager instance is working
correctly, but the application is not working as you expect then inspect the Operational Decision
Manager logs. If you need, change the logging levels
to get more detail on the suspected problem.
Operational Decision
Manager runs on Liberty profile, which
uses a unified logging component for handling messages. The logging component also provides First
Failure Data Capture (FFDC) services, and unifies the messages that are written to
System.out
, System.err
, and java.util.logging
with other messages. The logging component is controlled through the server configuration.
You customize the logging properties by adding logging elements to a server configuration file
and then creating a Kubernetes configMap
to apply to the configuration. The
configuration of the log level uses the following format:
<component> = <level>
Where <component> is the component for which to set a log level, and
<level> is one of the valid logger levels. For more information, see Customizing log levels.