We are using LoadL 3.4.1 on a 40 node p5-575 1600 cluster (each node with 16 P5+ procs) and we are trying to handle the following types of jobs.
MPMD job with 5 different POE binaries, each of which is also an OMP type of parallel code BUT with variable number of OMP threads
We are submitting this as an MPMD POE job type, but we have no (known to us) way of specifying to LL the different number of OMP threads for each different MPI code, since the ConsumableCpus( N ) is a single entry in the LL script.
How could one specifying to LL the correct ConsumableCpus( ) for each specific POE binary w/o either oversubscribing the system (eg use min ConsumableCpus() for all POE binaries) or over-allocating resource (by specifing the max ConsumableCpus()) ?
We also have a 32 processor p690 and I was wondering how the same type of LL jobs can be properly specified there.
This topic has been locked.
1 reply Latest Post - 2008-02-19T02:56:23Z by SystemAdmin
Pinned topic LoadLeveler jobs with hybrid POE and OMP tasks
Answered question This question has been answered.
Unanswered question This question has not been answered yet.
Updated on 2008-02-19T02:56:23Z at 2008-02-19T02:56:23Z by SystemAdmin
SystemAdmin 110000D4XK46 PostsACCEPTED ANSWER
Re: LoadLeveler jobs with hybrid POE and OMP tasks2008-02-19T02:56:23Z in response to michael-tMichael,
The consumableCpus statement applies to each of the tasks of the parallel job the Job Command File represents. For example, if you have 2 POE tasks on a node and ConsumableCpus is set to 2 in the JCF, a total of 4 CPUs are allocated for this job on that node.
The consumableCPUs statement cannot be used to specify the different number of CPUs required for the 5 different binaries in a single job.
Can the 5 different binaries in your example be submitted in 5 different job steps in a single job ? For each of the job steps, you can specify a different number of CPUs required. Co-scheduling can help ensure the job steps get scheduled to run at the same time. But one possible problem is that if one binary needs to communicate with another binary, it may not know where/how to contact ?