Memory and swap limit enforcement based on Linux cgroup memory subsystem
All LSF job
processes are controlled by the Linux cgroup system so that cgroup memory and swap limits cannot be
exceeded. These limits are enforced on a per job and per host basis, not per task, on Red Hat
Enterprise Linux (RHEL) 6.2 or above and SuSe Linux Enterprise Linux 11 SP2 or above. LSF
enforces memory and swap limits for jobs by periodically collecting job usage and comparing it with
limits set by users.
LSF can impose strict host-level memory and swap limits on systems that support Linux cgroup
v1 or cgroup v2 cgroups.
Different LSF hosts
in the cluster can use different versions of cgroup as long as each individual LSF host
is only running one version of cgroup. If you have both versions of cgroup enabled in a host, you
must disable one of the versions. For example, hostA can use cgroup v1 and hostB
can use cgroup v2 as long as each host is only running one version of
cgroup.
To enable memory enforcement through the Linux cgroup memory subsystem, configure the LSB_RESOURCE_ENFORCE="memory" parameter in the lsf.conf file.
If the host OS is Red Hat Enterprise Linux 6.3 or above, cgroup memory limits are enforced, and LSF is notified to terminate the job. More notification is provided to users through specific termination reasons that are displayed by bhist –l.
To change (such as reduce) the cgroup memory limits for running jobs, use bmod -M to change a job's memory limit, and use bmod -v to change the swap limit for a job. Note that when you request to change a job's memory or swap limits, LSF modifies the job's requirement to accommodate your request, and modifies the job's cgroup limits setting. If the OS rejects the cgroup memory or swap limit modifications, LSF posts a message to the job to indicate that the cgroup is not changed. After the cgroup limit changes, the OS can adjust the job's memory or swap allocation. As a best practice, do not decrease the cgroup memory or swap limit to less than your application use.
- LSF_PROCESS_TRACKING=Y
- LSF_LINUX_CGROUP_ACCT=Y
Setting LSB_RESOURCE_ENFORCE="memory" automatically turns on cgroup accounting (LSF_LINUX_CGROUP_ACCT=Y) to provide more accurate memory and swap consumption data for memory and swap enforcement checking. Setting LSF_PROCESS_TRACKING=Y enables LSF to kill jobs cleanly after memory and swap limits are exceeded.
Example
Submit a parallel job with 3 tasks and a memory limit of 100 MB, with span[ptile=2] so that 2 tasks can run on one host and 1 task can run on another host:
bsub -n 3 -M 100 –R "span[ptile=2]" blaunch ./mem_eater
The application mem_eater keeps increasing the memory usage.
LSF kills the job at any point in time that it consumes more than 200 MB total memory on hosta or more than 100 MB total memory on hostb. For example, if at any time 2 tasks run on hosta and 1 task runs on hostb, the job is killed only if total memory consumed by the 2 tasks on hosta exceeds 200 MB on hosta or 100 MB in hostb.
LSF does not support per task memory enforcement for cgroups. For example, if one of the tasks on hosta consumes 150 MB memory and the other task consumes only 10 MB, the job is not killed because, at that point in time, the total memory that is consumed by the job on hosta is only 160 MB.
Memory enforcement does not apply to accumulated memory usage. For example, two tasks consume a maximum 250 MB on hosta in total. The maximum memory rusage of task1 on hosta is 150 MB and the maximum memory rusage of task2 on hosta is 100 MB, but this never happens at the same time, so at any given time, the two tasks consumes less than 200M and this job is not killed. The job would be killed only if at a specific point in time, the two tasks consume more than 200M on hosta.
For example, for the following job submission:
bsub -M 100 -v 50 ./mem_eater
After the application uses more than 100 MB of memory, the cgroup will start to use swap for the job process. The job is not killed until the application reaches 150 MB memory usage (100 MB memory + 50 MB swap).
The following job specifies only a swap limit:
bsub -v 50 ./mem_eater
Because no memory limit is specified, LSF considers the memory limit to be same as a swap limit. The job is killed when it reaches 50 MB combined memory and swap usage.
Host-based memory and swap limit enforcement by Linux cgroups
When the LSB_RESOURCE_ENFORCE="memory" parameter is configured in the lsf.conf file, memory and swap limits are calculated and enforced as a multiple of the number of tasks running on the execution host when memory and swap limits are specified for the job (at the job-level with -M and -v, or in lsb.queues or lsb.applications with MEMLIMIT and SWAPLIMIT).
The bsub -hl option enables job-level host-based memory and swap limit enforcement regardless of the number of tasks running on the execution host. The LSB_RESOURCE_ENFORCE="memory" parameter must be specified in lsf.conf for host-based memory and swap limit enforcement with the -hl option to take effect.
If no memory or swap limit is specified for the job (the merged limit for the job, queue, and application profile, if specified), or the LSB_RESOURCE_ENFORCE="memory" parameter is not specified, a host-based memory limit is not set for the job. The -hl option only applies only to memory and swap limits; it does not apply to any other resource usage limits.
Limitations and known issues
- For parallel jobs, cgroup limits are only enforced for jobs that are launched through the LSF blaunch framework. Parallel jobs that are launched through LSF PAM/taskstarter are not supported.
- On RHEL 6.2, LSF cannot receive notification from the cgroup that memory and swap limits are exceeded. When job memory and swap limits are exceeded, LSF cannot guarantee that the job is killed. On RHEL 6.3, LSF does receive notification and kills the job.
- On RHEL 6.2, a multithreaded application becomes a zombie process if the application is killed by cgroup due to memory enforcement. As a result, LSF cannot wait for the user application exited status and LSF processes are hung. LSF recognizes the job does not exit and the job always runs.