Watson AI Services: Watson Discovery - elastic search pod is in crashloopBackoff

Troubleshooting

Problem

The Watson Discovery system is unstable. Various pods continue to restart. The elasticsearch pod is in crashloopBackoff.

Symptom

The elasticsearch pod is in crashloopBackoff.

Cause

In the <helm-release-name>-watson-discovery-elastic-0-elastic.log :

2020-03-11 14:36:39,608 main ERROR Recovering from StringBuilderEncoder.encode('[2020-03-11T14:36:39,591][ERROR][o.e.b.Bootstrap          ] [disco-watson-discovery-elastic-0] [{}] Exception
java.lang.IllegalStateException: Failed to create node environment
	at org.elasticsearch.node.Node.<init>(Node.java:299) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.node.Node.<init>(Node.java:265) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:212) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:212) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:333) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:136) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:127) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) [elasticsearch-cli-6.5.2.jar:6.5.2]
	at org.elasticsearch.cli.Command.main(Command.java:90) [elasticsearch-cli-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:93) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:86) [elasticsearch-6.5.2.jar:6.5.2]
Caused by: java.io.IOException: No space left on device

This is often caused by OOM's within the elastic pod which generate heapdumps in the /wexdata/logs/dump directory, which fills up the elastic search data pvc.

Environment

Discovery 2.1.1

CP4D 2.5

Resolving The Problem

1) To alleviate the disc space error, due to extensive OOM dumps, you must delete the contents of /wexdata/logs/dump, on the elastic search node.

Issuing the following may allow elastic to start, if the command can get issued while the pod is starting/to run before it crashes.

oc exec -t  disco-watson-discovery-elastic-0 -- rm /wexdata/logs/dump/*

Note this command has been seen NOT to work. The error is:

rm: cannot remove ‘/wexdata/logs/dump/*’: No such file or directory
command terminated with exit code 1

It can also be accomplished with the following sequence:

# Log into the (elastic search) container's bash shell

[root@wd2-master-1 ~]# kubectl exec -it zen-wd-watson-discovery-elastic-0 bash

# change directories to /wexdata/logs/dump

[dadmin@zen-wd-watson-discovery-elastic-0 /]$ cd /wexdata/logs/dump/

# Issue the rm command

[dadmin@zen-wd-watson-discovery-elastic-0 dump]$ rm -rf *

# Confirm the directory is empty

[dadmin@zen-wd-watson-discovery-elastic-0 dump]$ ls

[dadmin@zen-wd-watson-discovery-elastic-0 dump]$

This process will hopefully allow the pod to start; however, it does nothing to alleviate the true cause.

2) Increase the size of ES_MAX_HEAP

Given that there are OOMs ,we might want to proactively increase memory for this pod.

From this we should issue

oc patch statefulset $(oc get statefulsets -l app.kubernetes.io/component=elastic -o jsonpath='{.items[0].metadata.name}') -p \
  '{"spec":{"template":{"spec":{"containers":[{"name":"elastic","env":[{"name":"ES_MAX_HEAP","value":"4g"}],"resources":{"limits":{"cpu":"99","memory":"8Gi"},"requests":{"cpu":"750m","memory":"6Gi"}}}]}}}}'

Where:

ES_MAX_HEAP - Sets the max JVM heap space used by the elasticsearch process. This must be greater than or equal to resources.limits.memory, and should generally be about 1/2 - 2/3 the limit value
resources.limits.memory - sets the MAX memory OpenShift/kubernetes will give to the pod
resources.requests.memory - The is the guaranteed amount of memory OpenShift/kubernetes the pods will have. This MUST be less than or equal to limit, and should be greater than ES_MAX_HEAP

All of those are needed to attempt to alleviate the OOM conditions.

3) Add an anelastic pod

In addition to the above steps, we might need/want to add anelastic pod (i.e. scale to 2 elastic )
assuming there are multiple collections, it will help distribute the indices across pods so less memory is needed per pod

To add an additional elastic pod

oc scale sts <helm-release-name>-watson-discovery-elastic  --replicas=2

Document Location

Worldwide

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSCLA6","label":"Watson Discovery"},"Component":"elastic search","Platform":[{"code":"PF016","label":"Linux"}],"Version":"2.1.1","Edition":"","Line of Business":{"code":"","label":""}}]

Tips

Watson AI Services: Watson Discovery - elastic search pod is in crashloopBackoff

Troubleshooting

Problem

Symptom

Cause

Environment

Resolving The Problem

Document Location

Product Synonym

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?