IBM Support

Watson AI Services: Watson Discovery - elastic search pod is in crashloopBackoff

Troubleshooting


Problem

The Watson Discovery system is unstable.  Various pods continue to restart. The elasticsearch pod is in crashloopBackoff.

Symptom

The elasticsearch pod is in crashloopBackoff.

Cause

 In the <helm-release-name>-watson-discovery-elastic-0-elastic.log :

2020-03-11 14:36:39,608 main ERROR Recovering from StringBuilderEncoder.encode('[2020-03-11T14:36:39,591][ERROR][o.e.b.Bootstrap          ] [disco-watson-discovery-elastic-0] [{}] Exception
java.lang.IllegalStateException: Failed to create node environment
	at org.elasticsearch.node.Node.<init>(Node.java:299) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.node.Node.<init>(Node.java:265) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:212) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:212) ~[elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:333) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:136) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:127) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) [elasticsearch-cli-6.5.2.jar:6.5.2]
	at org.elasticsearch.cli.Command.main(Command.java:90) [elasticsearch-cli-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:93) [elasticsearch-6.5.2.jar:6.5.2]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:86) [elasticsearch-6.5.2.jar:6.5.2]
Caused by: java.io.IOException: No space left on device

This is often caused by OOM's within the elastic pod which generate heapdumps in the /wexdata/logs/dump directory, which fills up the elastic search data pvc.

Environment

Discovery 2.1.1
CP4D 2.5

Resolving The Problem

1) To alleviate the disc space error, due to extensive OOM dumps, you must delete the contents of /wexdata/logs/dump, on the elastic search node.

Issuing the following may allow elastic to start, if the command can get issued while the pod is starting/to run before it crashes.

oc exec -t  disco-watson-discovery-elastic-0 -- rm /wexdata/logs/dump/*
Note this command has been seen NOT to work. The error is:
rm: cannot remove ‘/wexdata/logs/dump/*’: No such file or directory
command terminated with exit code 1
It can also be accomplished with the following sequence:
# Log into the (elastic search) container's bash shell
[root@wd2-master-1 ~]# kubectl exec -it zen-wd-watson-discovery-elastic-0 bash
# change directories to /wexdata/logs/dump

[dadmin@zen-wd-watson-discovery-elastic-0 /]$ cd /wexdata/logs/dump/
# Issue the rm command
[dadmin@zen-wd-watson-discovery-elastic-0 dump]$ rm -rf *
# Confirm the directory is empty
[dadmin@zen-wd-watson-discovery-elastic-0 dump]$ ls
[dadmin@zen-wd-watson-discovery-elastic-0 dump]$
This process will hopefully allow the pod to start; however, it does nothing to alleviate the true cause.
2) Increase the size of  ES_MAX_HEAP

Given that there are OOMs ,we might want to proactively increase memory for this pod.

From this we should issue

oc patch statefulset $(oc get statefulsets -l app.kubernetes.io/component=elastic -o jsonpath='{.items[0].metadata.name}') -p \
  '{"spec":{"template":{"spec":{"containers":[{"name":"elastic","env":[{"name":"ES_MAX_HEAP","value":"4g"}],"resources":{"limits":{"cpu":"99","memory":"8Gi"},"requests":{"cpu":"750m","memory":"6Gi"}}}]}}}}'
Where:
  • ES_MAX_HEAP - Sets the max JVM heap space used by the elasticsearch process. This must be greater than or equal to resources.limits.memory, and should generally be about 1/2 - 2/3 the limit value
  • resources.limits.memory - sets the MAX memory OpenShift/kubernetes will give to the pod
  • resources.requests.memory - The is the guaranteed amount of memory OpenShift/kubernetes the pods will have. This MUST be less than or equal to limit, and should be greater than ES_MAX_HEAP
All of those are needed to attempt to alleviate the OOM conditions.
3) Add an anelastic pod
In addition to the above steps, we might need/want to add anelastic pod (i.e. scale to 2 elastic )
assuming there are multiple collections, it will help distribute the indices across pods so less memory is needed per pod

To add an additional elastic pod

oc scale sts <helm-release-name>-watson-discovery-elastic  --replicas=2

Document Location

Worldwide

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSCLA6","label":"Watson Discovery"},"Component":"elastic search","Platform":[{"code":"PF016","label":"Linux"}],"Version":"2.1.1","Edition":"","Line of Business":{"code":"","label":""}}]

Product Synonym

WD

Document Information

Modified date:
13 March 2020

UID

ibm15737749