General Page
Often MPI job on Linux is affected by memorylocked resource limit. The default setting 64K of this limit will fail MPI job inside LSF. Here is one example of a failed Intel MPI job running inside LSF.
IntelMPI job failed with error:
DAPL startup: RLIMIT_MEMLOCK too small
MPI startup(): dapl fabric is not available and fallback fabric is not enabled
Setting proper resource limit, such as memory limit, stack limit, memorylocked limit, etc. is important for job to run to complete. LSF allows configure resource limits for LSF jobs, but if you don't do so, default value for most of resource limits inside LSF is "unlimited". Besides specifying resource limit in job submission or through LSF configuration files (see man page for lsb.queues, lsb.applications, bsub/bmod, etc.), "bsub -ul" provides way for carrying resource limit setting from job submission host to job execution host.
However some of resource limits can not be adjusted through LSF configuration or job submission, memorylocked limit is one of them. For those unchangeable limits their actual setting for a LSF job is inherited from LSF service process, sbatchd or res.
To make MPI job running properly inside LSF, LSF 10.1 has done an enhancement to set memorylocked limit to "unlimited". LSF 10.1 has already added "limitMEMLOCK=infinity" in LSF service unit file (/usr/lib/systemd/system/lsfd.service) under systemd. This way when LSF services get started after system boot (or reboot), sbatchd and res processes have memorylocked limit set to "unlimited", thus MPI job launched through LSF sbatchd or res will inherit "unlimited" memorylocked.
If LSF services are managed through system V init.d, user can manually edit /etd/init.d/lsf by adding "ulimit -l unlimited" at the beginning of start_daemon() function.
More information about setting resource limits for LSF job can be found under the link, https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_admin/resource_usage_limits_specify.html.
Was this topic helpful?
Document Information
Modified date:
01 October 2019
UID
ibm10741127