IBM Support

Watson Studio Local - fluentd process killed by OOM killer

Troubleshooting


Problem

The fluentd process in the pods gets killed by the Linux OOM killer. Problem can't be resolved by increasing the amount of RAM of the worker nodes. When the OOM occurs, the entire node crashes.

Below is a sample of the dmesg messages that can be found:

[Thu Jul 16 06:30:54 2020] Memory cgroup out of memory: Kill process 34835 (fluentd) score 1000 or sacrifice child
[Thu Jul 16 06:30:54 2020] Killed process 34835 (fluentd) total-vm:103616kB, anon-rss:52992kB, file-rss:10304kB, shmem-rss:0kB
[Thu Jul 16 06:30:54 2020] oom_reaper: reaped process 34835 (fluentd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Resolving The Problem

This kind of message will not be fixed by throwing memory at it. The issue with OOM messages indicates a memory threshold was breached. To correct this the best approach is to increase the memory limit:
kubectl edit ds -n sysibm-adm fluentd-es-ds 

And change:
    name: fluentd-elasticsearch
        resources:
          limits:
            memory: 200Mi

          requests:
            cpu: 100m
            memory: 200Mi
        securityContext:
          privileged: true
          procMount: Default
to
        resources:
          limits:
            memory: 400Mi
By default the limit is set to 200Mi. Change it to 400Mi and see if that stops the OOM. If not change it to 800Mi and try again.
After making the above change, the fluentd pods need to be restarted.

Document Location

Worldwide

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHGWL","label":"IBM Watson Studio Local"},"ARM Category":[{"code":"a8m0z000000bmvTAAQ","label":"Admin->Node admin"}],"ARM Case Number":"TS003883207","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
29 July 2020

UID

ibm16254295