Troubleshooting
Problem
"eauth" process is consuming large amount of memory on LSF master host.
Symptom
You observed that the long running "eauth" process has consumed a very large amount of memory on the LSF master host, e.g.:
$ top
top - 13:18:07 up 41 days, 9:11, 1 user, load average: 6.19, 7.62, 8.51 Tasks: 475 total, 4 running, 470 sleeping, 0 stopped, 1 zombie
Cpu(s): 47.6%us, 2.7%sy, 0.0%ni, 48.0%id, 0.4%wa, 0.0%hi, 1.4%si, 0.0%st Mem: 268435456k total, 67108864k used, 201326592k free, 524288k buffers Swap: 67108864k total, 0k used, 67108864k free, 8388608k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1234 root 20 0 16.7g 16g 700 S 0.3 6.3 251:21.57 eauth
4001 root 20 0 4214m 3.1g 6772 R 88.0 1.2 41905:00 mbatchd
29328 lsfadmin 20 0 3415m 3.2g 3740 R 60.4 1.3 48:43.12 mbschd
Cause
The "eauth" binary will encounter memory leak issue when large amount of jobs are being submitted into the cluster. In such situation, memory consumption by "eauth" will grow continuously until a "badmin mbdrestart" command is triggered, which would kill the old "eauth" process, and then restart a new "eauth" process again when a new job is submitted. The new "eauth" process will consume low memory at start, but will continue to grow again whenever large amount of jobs are being submitted to the cluster.
Resolving The Problem
There is a fix pack to resolve this issue. You can download the fix pack lsf-9.1.3-build409434 from IBM Support Fix Central. After downloading it, please follow the instructions given in the Readme file of the fix pack to install the fix pack into your cluster.
Was this topic helpful?
Document Information
Modified date:
17 June 2018
UID
isg3T1023949