The first step in troubleshooting an Operational Decision Manager deployment on Certified
Kubernetes is triage. In most cases, the problem is likely to be one of four things: your pods, your
replicas, your services, or your database.
About this task
On Kubernetes, Operational Decision Manager services are run in
pods.
- A pod is a group of one or more containers with shared storage and network. For Operational Decision Manager, each pod wraps a single docker
container.
- Replicas are used by Deployments
as a mechanism to create, delete, and update pods. Ordinarily,
you do not have to worry about managing the replicas that deployments create. Deployments own and
manage their replicas. You can specify how many pods to run concurrently by setting
.spec.replicaCount.
- Services balance the load across a set of pods.
- The database is where the data is persisted.
Procedure
The Kubernetes command-line tool, kubectl
,
allows you to run commands against Kubernetes clusters.
- Display the status of the named release.
-
List all of the container images that run in your cluster.
To target the pods in a specific namespace, use the namespace flag.
$ kubectl get pods -n NAMESPACE
-
Check the current state and recent events of your pods to see whether they are all
running.
You can get more targeted information on the pod by running the
kubectl
get and
describe
commands.
$ kubectl get pod POD_NAME --output=yaml
$ kubectl describe pod POD_NAME
-
If the pods are not created, check the status of the replica sets and inspect the events for
one of them by running:
$ kubectl get replicaset -n NAMESPACE
$ kubectl describe replicaset -n NAMESPACE replicaset_name
If you find an error similar to the following one:
unable to validate against any security context constraint: [spec.containers[0].securityContext.securityContext.runAsUser:
Invalid value: 1001: must be in the ranges: [1000140000, 1000149999]]
Check that the SecurityContextConstraints (scc) are correctly granted to the
serviceAccount used in your deployment. For details, see Preparing to install Operational Decision Manager.
-
If a pod is stuck in Pending, look at the output of the
kubectl describe command.
Find the messages from the scheduler about why it cannot schedule your pod. The most likely
reason is that you do not have enough resources. The supply of CPU or Memory in your cluster might
be exhausted, in this case you need to delete pods, adjust resource requests, or add new nodes to
your cluster.
-
If a pod is stuck in the Waiting state or
Init:ImagePullBackOff, look at the output of the
kubectl
describe command.
The most common cause of Waiting pods is a failure to pull the
image. If you installed a release from the command line check that the name of the image is correct,
check that you pushed the image to the repository, and try to pull the image.
-
If a pod is stuck in Running state and does not turn to
Ready state after a while, it might be that the health check takes
longer than the readiness and liveness timeout values. When you edit an existing deployment, another
pod is created with the new timeout values.
Note: If a Decision Center pod is
stuck in Running and does not change to Ready after a while, the
database configuration might be broken and the pod never starts. To resolve the problem, clean the
database and increase the readinessProbe values. The readiness by default
times-out in approximately 5 minutes, so if there is some cluster slowness, network or database
latency, then double the number.
- If a job or pod named
<RELEASE_NAME>-odm-usage-metering-xxxx is in
an error state, check the logs to verify that the Usage Metering Service is correctly installed and
running. Run the following command to inspect the logs:
bash
kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=ibm-usage-metering-instance --tail=100
For
more information about how to install the
Usage Metering Service, see
Installing the Usage Metering Service.
- If a pod is Evicted, it might have reached the ephemeral
storage limit. Check the logs for the following error message:
Pod ephemeral local storage usage exceeds the total limit of containers 500Mi
To fix the issue, increase the resources.limits.ephemeral-storage value of
the corresponding component. See Production configuration parameters
to find the default values.
-
If a pod fails or is otherwise unhealthy, look at the logs of the current pod.
$ kubectl logs POD_NAME
If your pod previously
failed, add the
previous argument to access these logs.
$ kubectl logs --previous POD_NAME
Or you can run commands
inside that pod with
exec.
$ kubectl exec POD_NAME -- CMD ARG1 ARG2 ... ARGN
Note: The
double-dash symbol -- is used to separate the arguments that you want to pass to
the command from the kubectl arguments.
For example, to get a shell to the running pod.
$ kubectl exec -ti POD_NAME -- /bin/bash
In your shell, list the root directory and use other commands to view the configuration. For
example:
bash-4.4$ ls
bash-4.4$ cat /config/server.xml
bash-4.4$ cat /config/datasource.xml
bash-4.4$ cat /proc/mounts
bash-4.4$ cat /proc/1/maps
-
Check whether the services are working correctly.
Verify the endpoints for your services by running the get endpoints command.
$ kubectl get endpoints SERVICE_NAME
For every service, an endpoint resource is made available. Each ODM for production component runs
in a pod in a separate service so the number of endpoints is expected to be 1 per service.
To get information on a specified pod, enter the following command.
$ kubectl get pods --selector=release=mycompany-dev1,run=mycompany-dev1-odm-decisionrunner
If the list of pods matches expectations and your endpoint is still empty, it is possible that
the ports are not exposed. If your service specifies a port, but the pod does not list that port
then the port is not added to the endpoints list. Verify that the port of the pod matches the port
of the service.
- Check whether the internal database container is up and running.
If an internal database is configured, an init container checks whether a
persistent volume is available before it deploys the Operational Decision Manager containers.
If the database is down, the probable causes are:
- There is no available persistent volume.
- There is not enough memory for the database.
Results
If nothing looks wrong in your configuration and you continue to get no response when you try to
access your service, see Debug Services
.
What to do next
If the Operational Decision Manager deployment is working
correctly, but the product is not working as you expect then inspect the Operational Decision Manager logs. If you need, change the logging levels
to get more detail on the suspected problem.
Operational Decision Manager runs on Liberty profile, which
uses a unified logging component for handling messages. The logging component also provides First
Failure Data Capture (FFDC) services, and unifies the messages that are written to
System.out, System.err, and java.util.logging
with other messages. The logging component is controlled through the server configuration.
You customize the logging properties by adding logging elements to a server configuration file
and then creating a Kubernetes configMap to apply to the configuration. The
configuration of the log level uses the following format:
component = level
Where component is the component for which to set a log level, and
level is one of the valid logger levels. For more information, see Customizing log levels.
For tips on resolving issues with OpenID, see Troubleshooting LDAP
and Troubleshooting OpenID
.