Configuring the LSF integration on Cray Linux
Set configuration parameters for the LSF integration on Cray Linux
Procedure
-
Modify $LSF_ENVDIR/lsf.conf.
Some of the following parameters may have been added by the LSF installation:
- LSB_SHAREDIR=/ufs/lsfhpc/work - A shared file system that is accessible by root and the LSF administrator on both management hosts and Cray Linux login/service nodes.
- LSF_LOGDIR=/ufs/lsfhpc/log - A shared file system that is accessible by root and the LSF administrator on both management hosts and Cray Linux login/service nodes.
- LSF_LIVE_CONFDIR=/ufs/lsfhpc/work/<cluster_name>/live_confdir - A shared file system that is accessible by root and the LSF administrator on both management hosts and Cray Linux login/service nodes.
- LSB_RLA_PORT=21787 - a unique port
- LSB_SHORT_HOSTLIST=1
- LSF_ENABLE_EXTSCHEDULER=Y
- LSB_SUB_COMMANDNAME=Y
- LSF_CRAY_PS_CLIENT=/usr/bin/apbasil
- LSF_LIMSIM_PLUGIN="liblimsim_craylinux"
- LSF_CRAYLINUX_FRONT_NODES="nid00060 nid00062" - A list of Cray Linux login/service nodes with LSF daemons started and running.
- LSF_CRAYLINUX_FRONT_NODES_POLL_INTERVAL=120 - Interval for the management host LIM polling RLA to query computer node status and configuration information. Default value is 120 seconds. Any value less than 120 seconds will be reset to default
- LSB_MIG2PEND=1
- LSF_CRAY_RUR_DIR=/ufs/lsfhpc/work/<cluster_name>/craylinux/<cray_machine_name>/rur
- Location of the RUR data files, which is a shared file system that is accessible from any
potential first execution host. An RUR data file for jobs that are submitted by all users is named
rur.output. A job-specific RUR data file for specific job IDs are named
rur.<jobID>. The default value is
LSF_SHARED_DIR/<cluster_name>/craylinux/<cray_machine_name>/rur.
You can use the %U special character to represent the home directory of the user that submitted the job. For example, if you specify LSF_CRAY_RUR_DIR=%U/.rur, and userA and userB submitted jobs, the RUR data files are located in /home/userA/.rur for userA and /home/userB/.rur for userB.
- LSF_CRAY_RUR_PROLOG_PATH=<path_to_rur_prologue.py> - File path to the RUR
prolog script file. Default value is /opt/cray/rur/default/bin/rur_prologue.py.Note: LSF runs the prolog script file with the -j <jobID> option. Therefore, the prolog script file must support the -j option with the job ID as the argument.
- LSF_CRAY_RUR_EPILOG_PATH=<path_to_rur_epilogue.py> - File path to the RUR
epilog script file. Default value is /opt/cray/rur/default/bin/rur_epilogue.py.Note: LSF runs the epilog script file with the -j <jobID> option. Therefore, the epilog script file must support the -j option with the job ID as the argument.
-
From a Cray login node, run the $LSF_BINDIR/genVnodeConf command.
This command generates a list of compute nodes in BATCH mode. You can add the compute nodes to the HOST section in $LSF_ENVDIR/lsf.cluster.<cluster_name>.
HOSTNAME model type server r1m mem swp RESOURCES nid00038 ! ! 1 3.5 () () (craylinux vnode) nid00039 ! ! 1 3.5 () () (craylinux vnode) nid00040 ! ! 1 3.5 () () (craylinux vnode) nid00041 ! ! 1 3.5 () () (craylinux vnode) nid00042 ! ! 1 3.5 () () (craylinux vnode gpu) nid00043 ! ! 1 3.5 () () (craylinux vnode gpu) nid00044 ! ! 1 3.5 () () (craylinux vnode) nid00045 ! ! 1 3.5 () () (craylinux vnode) nid00046 ! ! 1 3.5 () () (craylinux vnode) nid00047 ! ! 1 3.5 () () (craylinux vnode) nid00048 ! ! 1 3.5 () () (craylinux vnode) nid00049 ! ! 1 3.5 () () (craylinux vnode) nid00050 ! ! 1 3.5 () () (craylinux vnode) nid00051 ! ! 1 3.5 () () (craylinux vnode) nid00052 ! ! 1 3.5 () () (craylinux vnode gpu) nid00053 ! ! 1 3.5 () () (craylinux vnode gpu) nid00054 ! ! 1 3.5 () () (craylinux vnode) nid00055 ! ! 1 3.5 () () (craylinux vnode) nid00056 ! ! 1 3.5 () () (craylinux vnode) nid00057 ! ! 1 3.5 () () (craylinux vnode)
-
Configure $LSF_ENVDIR/hosts.
Make sure that the IP addresses of compute nodes do not conflict with any IP address that is already in use.
cat $LSF_ENVDIR/hosts 10.128.0.34 nid00033 c0-0c1s0n3 sdb001 sdb002 10.128.0.61 nid00060 c0-0c1s1n0 login login1 castor-p2 10.128.0.36 nid00035 c0-0c1s1n3 10.128.0.59 nid00058 c0-0c1s2n0 10.128.0.38 nid00037 c0-0c1s2n3 10.128.0.57 nid00056 c0-0c1s3n0 10.128.0.58 nid00057 c0-0c1s3n1 10.128.0.39 nid00038 c0-0c1s3n2 10.128.0.40 nid00039 c0-0c1s3n3 10.128.0.55 nid00054 c0-0c1s4n0 10.128.0.56 nid00055 c0-0c1s4n1 10.128.0.41 nid00040 c0-0c1s4n2 10.128.0.42 nid00041 c0-0c1s4n3 10.128.0.53 nid00052 c0-0c1s5n0 10.128.0.54 nid00053 c0-0c1s5n1 10.128.0.43 nid00042 c0-0c1s5n2 10.128.0.44 nid00043 c0-0c1s5n3 10.128.0.51 nid00050 c0-0c1s6n0 10.128.0.52 nid00051 c0-0c1s6n1 10.128.0.45 nid00044 c0-0c1s6n2 10.128.0.46 nid00045 c0-0c1s6n3 10.128.0.49 nid00048 c0-0c1s7n0 10.128.0.50 nid00049 c0-0c1s7n1 10.128.0.47 nid00046 c0-0c1s7n2 10.128.0.48 nid00047 c0-0c1s7n3 10.131.255.251 sdb sdb-p2 syslog ufs
-
Modify
$LSF_ENVDIR/lsbatch/<cluster_name>/configdir/lsb.hosts.
Make sure to set a large number in the MXJ column for the Cray Linux login and service nodes that are also LSF server hosts. The number should be larger than the total number of PEs.
Begin Host HOST_NAME MXJ r1m pg ls tmp DISPATCH_WINDOW # Keywords nid00060 9999 () () () () () # Example nid00062 9999 () () () () () # Example default ! () () () () () # Example End Host
-
Modify
$LSF_ENVDIR/lsbatch/<cluster_name>/configdir/lsb.queues.
- JOB_CONTROLS and RERUNNABLE are required.
- Comment out all
loadSched
/loadStop
lines. - DEFAULT_EXTSCHED and MANDATORY_EXTSCHED are optional.
- To run CCM jobs, you must get the pre-execution and post-execution binary files from Cray. Refer to Cray documentation to find these files.
Begin Queue QUEUE_NAME = normal PRIORITY = 30 NICE = 20 PREEMPTION = PREEMPTABLE JOB_CONTROLS = SUSPEND[bmig $LSB_BATCH_JID] RERUNNABLE = Y #RUN_WINDOW = 5:19:00-1:8:30 20:00-8:30 #r1m = 0.7/2.0 # loadSched/loadStop #r15m = 1.0/2.5 #pg = 4.0/8 #ut = 0.2 #io = 50/240 #CPULIMIT = 180/hostA # 3 hours of hostA #FILELIMIT = 20000 #DATALIMIT = 20000 # jobs data segment limit #CORELIMIT = 20000
#TASKCLIMIT = 5 # job task limit
#USERS = all # users who can submit jobs to this queue #HOSTS = all # hosts on which jobs in this queue can run #PRE_EXEC = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out #POST_EXEC = /usr/local/lsf/misc/testq_post |grep -v "Hey" #REQUEUE_EXIT_VALUES = 55 34 78 #APS_PRIORITY = WEIGHT[[RSRC, 10.0] [MEM, 20.0] [PROC, 2.5] [QPRIORITY, 2.0]] \ #LIMIT[[RSRC, 3.5] [QPRIORITY, 5.5]] \ #GRACE_PERIOD[[QPRIORITY, 200s] [MEM, 10m] [PROC, 2h]] DESCRIPTION = For normal low priority jobs, running only if hosts are lightly loaded. End Queue Begin Queue QUEUE_NAME = owners PRIORITY = 43 JOB_CONTROLS = SUSPEND[bmig $LSB_BATCH_JID] RERUNNABLE = YES PREEMPTION = PREEMPTIVE NICE = 10 #RUN_WINDOW = 5:19:00-1:8:30 20:00-8:30 r1m = 1.2/2.6 #r15m = 1.0/2.6 #r15s = 1.0/2.6 pg = 4/15 io = 30/200 swp = 4/1 tmp = 1/0 #CPULIMIT = 24:0/hostA # 24 hours of hostA #FILELIMIT = 20000 #DATALIMIT = 20000 # jobs data segment limit #CORELIMIT = 20000#TASKLIMIT = 5 # job task limit
#USERS = user1 user2 #HOSTS = hostA hostB #ADMINISTRATORS = user1 user2 #PRE_EXEC = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out #POST_EXEC = /usr/local/lsf/misc/testq_post |grep -v "Hey" #REQUEUE_EXIT_VALUES = 55 34 78 DESCRIPTION = For owners of some machines, only users listed in the HOSTS\ section can submit jobs to this queue. End Queue -
Modify $LSF_ENVDIR/lsf.shared.
Make sure the following boolean resources are defined in the RESOURCE section:
vnode Boolean () () (sim node) gpu Boolean () () (gpu) frontnode Boolean () () (login/service node) craylinux Boolean () () (Cray XT/XE MPI)
- By default, LSF_CRAY_RUR_ACCOUNTING=Y is enabled for LSF to work with Resource Utility Reporting (RUR). If RUR is not installed in your environment, you must disable RUR by setting LSF_CRAY_RUR_ACCOUNTING=N in lsf.conf.
-
Modify /etc/opt/cray/rur/rur.conf.
Disable the default prolog and epilog scripts by commenting out the following lines in the
apsys
section:apsys # prologPath - location of the executable file to be run before application # prologPath /usr/local/adm/sbin/prolog # epilogPath - location of the executable file to be run after application # epilogPath /usr/local/adm/sbin/epilog # prologTimeout - time in seconds before prolog is aborted as "hung" # prologTimeout 10 # epilogTimeout - time in seconds before epilog is aborted as "hung" # epilogTimeout 10 # prologPath /opt/cray/rur/default/bin/rur_prologue.py # epilogPath /opt/cray/rur/default/bin/rur_epilogue.py # prologTimeout 100 # epilogTimeout 100 /apsys
-
Modify /etc/opt/cray/alps/alps.conf.
Disable the default prolog and epilog scripts by commenting out the following lines in the
apsys
section:apsys # prologPath - location of the executable file to be run before application # prologPath /usr/local/adm/sbin/prolog # epilogPath - location of the executable file to be run after application # epilogPath /usr/local/adm/sbin/epilog # prologTimeout - time in seconds before prolog is aborted as "hung" # prologTimeout 10 # epilogTimeout - time in seconds before epilog is aborted as "hung" # epilogTimeout 10 # prologPath /opt/cray/rur/default/bin/rur_prologue.py # epilogPath /opt/cray/rur/default/bin/rur_epilogue.py # prologTimeout 100 # epilogTimeout 100 /apsys
-
Restart the alps daemon on the login nodes to apply the changes to the
alps.conf and rur.conf file.
/etc/init.d/alps restart
-
Use the service command to start and stop the LSF
services as needed.
- service LSF-HPC start
- service LSF-HPC stop