Troubleshooting
Problem
The fluentd process in the pods gets killed by the Linux OOM killer. Problem can't be resolved by increasing the amount of RAM of the worker nodes. When the OOM occurs, the entire node crashes.
Below is a sample of the dmesg messages that can be found:
[Thu Jul 16 06:30:54 2020] Memory cgroup out of memory: Kill process 34835 (fluentd) score 1000 or sacrifice child
[Thu Jul 16 06:30:54 2020] Killed process 34835 (fluentd) total-vm:103616kB, anon-rss:52992kB, file-rss:10304kB, shmem-rss:0kB
[Thu Jul 16 06:30:54 2020] oom_reaper: reaped process 34835 (fluentd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Resolving The Problem
This kind of message will not be fixed by throwing memory at it. The issue with OOM messages indicates a memory threshold was breached. To correct this the best approach is to increase the memory limit:
kubectl edit ds -n sysibm-adm fluentd-es-ds
And change:
name: fluentd-elasticsearch
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
securityContext:
privileged: true
procMount: Default
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
securityContext:
privileged: true
procMount: Default
to
resources:
limits:
memory: 400Mi
limits:
memory: 400Mi
By default the limit is set to 200Mi. Change it to 400Mi and see if that stops the OOM. If not change it to 800Mi and try again.
After making the above change, the fluentd pods need to be restarted.
Document Location
Worldwide
[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSHGWL","label":"IBM Watson Studio Local"},"ARM Category":[{"code":"a8m0z000000bmvTAAQ","label":"Admin->Node admin"}],"ARM Case Number":"TS003883207","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB10","label":"Data and AI"}}]
Was this topic helpful?
Document Information
Modified date:
29 July 2020
UID
ibm16254295