Explicit Resource File (ERF) format
Purpose
The Explicit Resource File (ERF) is a formatted file that is read or written as directed by the jsrun command options, –erf_input
and –erf_output
. The ERF format allows users to express the following:
- Regular and arbitrary rank placement, binding, and ordering of processes in a job step.
- Placement and binding in terms of SMT threads (PUs or processing units).
- Accessible resources in terms of SMT threads, GPUs, and memory.
- Rank placement and accessible PUs regardless of hardware SMT levels.
- Rank layout of both SPMD and MPMD applications.
- Hosts by logical or actual hostname.
- Ranges of hostnames, cpus, gpus, and memory.
- Comments
Syntax
The file is composed of a preamble that allows for the specification of the following symbols, which influence the entirety of the job step:
-
launch_distribution: VALUE: The possible values are the same as the jsrun
–launch_distribution
option. The value of the jsrun command-line option of the same name can be used to override the value in the file. If unspecified, the default value of the jsrun-launch_disribution
option is used. -
overlapping-rs: VALUE: The possible values are: error, warn and allow. When JSM determines that there are resources that are members of two or more resource sets, the value of
overlapping-rs
is used to determine how to proceed.- error - Issue an error when detected. Fail the launch (default).
- warn - Issue a warning when detected. Allow the launch.
- allow - No message issued when detected. Allow the launch.
-
oversubscribe-cpu: VALUE: The possible values are: error, warn and allow. When JSM determines that multiple ranks are assigned to a single CPU, the value of oversubscribe-cpu is used to determine how to proceed.
- error - Issue an error when detected. Fail the launch (default).
- warn - Issue a warning when detected. Allow the launch.
- allow - No message issued when detected. Allow the launch.
-
oversubscribe-gpu: VALUE: The possible values are: error, warn and allow. When JSM determines that multiple ranks are assigned to a single GPU, the value of
oversubscribe-gpu
is used to determine how to proceed.- error - Issue an error when detected. Fail the launch (default).
- warn - Issue a warning when detected. Allow the launch.
- allow - No message issued when detected. Allow the launch.
-
oversubscribe-mem: VALUE: The possible values are: error, warn and allow. When JSM determines that multiple ranks are assigned to the range of memory values, the value of
oversubscribe-mem
is used to determine how to proceed.- error - Issue an error when detected. Fail the launch (default).
- warn - Issue a warning when detected. Allow the launch.
- allow - No message issued when detected. Allow the launch.
-
cpu_index_using: VALUE: The possible values are: physical, and logical. Indicates whether the CPU specifications in the file are based on physical or logical identifiers.
The preamble can also contain information about the applications that are used in the job step:
app <#>: COMMAND LINE
The number must start with 0 and increment for each application specified. The command line is the full command line of the application. For job steps with a single application, the
command line can be specified on the jsrun command line rather than using the app syntax within the ERF file.
The body of the ERF file contains specifications of resource sets and the ranks to execute within each resource set. The specification of the resource set also explicitly specifies which of the CPUs in the resource set each rank should be bound to. Therefore, each line is a rank specification followed by a resource set specification:
RANK_SPECIFICATION : RESOURCE_SET_SPECIFICATION [: APPLICATION SPECIFICATION ]
The RANK_SPECIFICATION
is one of the following two formats.
Specific rank specification
rank: RANGE
RANGE values can be specified as follows:
# Example: 1 The brackets are optional for single values
# -# Example: 1-5 Meaning: 1,2,3,4,5 (range always increasing)
Process per resource set specification: NUMBER_OF_RANKS
NUMBER_OF_RANKS must be a whole number greater than 0. Ranks are assigned that uses the launch distribution. For more information, see the launch_distribution description. The RESOURCE_SET_SPECIFICATION is a bracketed list of key-value pairs:
{ key1 : value1 ; key2 : value2 ; key3: value3 ;\<etc> }
The key-value pairs must minimally contain the keys hosts and cpu. The permissible keys are:
-
hosts:
Specifies the host, which can be either a list of hostnames or a RANGE of host identifiers. The two specifications cannot be mixed within a single ERF. A RANGE is specified by using the same notation used to specify specific ranks. ID 0 refers to the launch node, which might not be a usable node in the allocation. The first compute node has ID 1.The order of the hosts specified will be preserved in the process layout. When ranks are specified using the Specific Rank Specification, only one host can be specified per line and wildcards cannot be used for hosts. For Process per Resource Set Specification, the user can use a “*” for the hostname. The wildcard creates one instance of the resource set on each host available in the allocation. The wildcard cannot be combined with specific host IDs for a single line. Multiple lines can contain a host wildcard specifier that allows for the ranks specified on different lines to be applied to all hosts in the system.
-
cpu:
Specifies groups of cpus. Each rank is mapped to a group in a round robin fashion and bound to the cpus in that group. There can be more groups than there are ranks. This can be used to add cpus to a resource set that is not used by any of the ranks in the resource set. All of the groups of cpus are contained in the resource set and are visible to all ranks of that resource set. The CPUs are specified using either physical or logical IDs as determined by thecpu_index_using: value
. CPUs groups can be expressed as an explicit range or using a “:” depth specifier, optionally surrounded with brackets ‘{}’.For example, {1,3,5-7} indicates CPUs: 1,3,5,6,7 and {0:3} indicates CPU 0 followed by the next two valid IDs on the system, which can be different depending on the SMT level of the machine. These can be mixed in a set of values (for example, {1,4-5},{10:7} is permitted), but not within a single value (for example, {1,10:7,20-22}is not permitted). If the given SMT number is invalid, then an error is returned. The wildcard “” can be used to express “All SMTs” on the node. A wildcard must be used as a single value in a list. It cannot be combined with other numbers for that single value (for example, {*,1}). It can be used in a list of values (for example, ,{1,2},3).
-
gpu:
Specifies a group of gpus. Each rank has access to the GPUs specified. GPUs are always specified using logical IDs. If the given GPU number is invalid, then an error is returned. The wildcard “” can be used to express “All GPUs” on the node. A wildcard must be used as a single value in a list. It cannot be combined with other numbers for that single value (for example, {*,1})*. -
mem:
Specifies groups of memory. Like the cpu specification, ranks are assigned to each group of memory in a round robin fashion. Memory is specified using a range of addresses. The wildcard "" can be used to express “All Memory” on the node (default). The wildcard must be used as a single value in a list and cannot be combined with other numbers for that single value (for example, {*,0- 4194303})*.The APPLICATION SPECIFICATION is used to indicate the command line that should be used to start ranks associated with the resource sets specified on the line. The format is
app <#>
.There must be a matching
app <#>
specification in the preamble with the exception of app 0, which can be specified either in the preamble or on the jsrun command line.
Examples
-
On all hosts, launch 6 PPN. 1 resource set per host containing the union of all SMTs, GPUs 0-5, and all memory. The processes are bound to the CPU groupings in the order specified.
# RS 0 : SMTs 0-5,16-18,20,32-34,64-66,96,97,128-130 # Node Local Rank 0 bound in RS 0 to SMT 32,33,34 # Node Local Rank 1 bound in RS 0 to SMT 0,1,2,3,4,5 # Node Local Rank 2 bound in RS 0 to SMT 64,65,66 # Node Local Rank 3 bound in RS 0 to SMT 128,129,130 # Node Local Rank 4 bound in RS 0 to SMT 96,97 # Node Local Rank 5 bound in RS 0 to SMT 16,17,18,206 : {host: * ; cpu: {32:3},{0:5},{64:3},{128:3},{96:2},{16-18,20} ; gpu: {0-5}}
-
On all hosts, launch 5 PPN in 3 resource sets per host.
# RS 0 : SMTs 32-34,0-5 GPUs 0,1 Memory All # RS 1 : SMTs 64-66,128-130 GPUs 2,3 Memory All # RS 2 : SMTs 96,97 GPUs 4,5 Memory All # Node Local Rank 0 bound in RS 0 to SMT 32,33,34 # Node Local Rank 1 bound in RS 0 to SMT 0,1,2,3,4,5 # Node Local Rank 2 bound in RS 1 to SMT 64,65,66 # Node Local Rank 3 bound in RS 1 to SMT 128,129,130 # Node Local Rank 4 bound in RS 2 to SMT 96 97{host:*cpu: {32:3},{0:5} ; gpu: {0,1} } 2 :{host: * ; cpu: {64:3},{128:3} ; gpu: {2,3} }1 : {host: * ; cpu: {96,97}; gpu: {45} }
-
On all hosts, launch 2 PPN each in 1 resource set containing all resources, but the process bound to subset of the resources. Notice the unused cpu specification for the remaining SMTs.
# RS 0 : SMTs 0-175 GPUs 0-5 Memory All # Node Local Rank 0 bound in RS 0 to SMT 0-7 # Node Local Rank 1 bound in RS 0 to SMT 8-152 : {host: * ; cpu: {0-7},{8-15},{16-175} ; gpu: {0-5} }
-
The following output is an example of a Manager/worker with the manager using 1 core as rank 0 on the first node, and workers compute intensive:
# RS 0 : SMTs 0-3 GPUs None Memory All # RS 1 : SMTs 4-99 GPUs 0-5 Memory All # Rank 0 Host 0 bound in RS 0 to SMT 0-3 # Rank 1-6 Host 0 bound in RS 1 to sets of 4 SMT-4 cores # Rank 1 bound to 4:16 , Rank 2 bound to 20:16 , ... # Rank 7-12 Host 1 bound in RS 1 to sets of 4 SMT-4 cores launch_distribution : packed 1 : {host: 0 ; cpu: {0:4} } 6 : {host: 0 ; cpu: {4:16},{20:16},{36:16},{52:16},{68:16},{84:16}; gpu: {0-5}} 6 : {host: 1 ; cpu: {4:16},{20:16},{36:16},{52:16},{68:16},{84:16}; gpu:{0-5}}
-
The following output is an example of an alternative to the scenario shown in step 4:
launch_distribution : packed 1 : {host: 0 ; cpu: {0:4} } 6 : {host: 0,1; cpu: {4:16},{20:16},{36:16},{52:16},{68:16},{84:16};gpu: {0-5}}
-
The following output is an example of a re-order of the ranks assignment as shown in step 5:
# Rank 0 Host 0 bound in RS 0 to SMT 0-3 # Rank 1-6 Host 1 bound in RS 1 to sets of 4 SMT-4 cores # Rank 7-12 Host 0 bound in RS 1 to sets of 4 SMT-4 cores launch_distribution : packed 1 : {host: 0 ; cpu: {0:4} } 6 : {host: 1,0; cpu: {4:16},{20:16},{36:16},{52:16},{68:16},{84:16};gpu: {0-5}}
-
The following output is a rank ordering example:
# RS 0 : SMTs 0-15 GPUs 0,1 Memory All # RS 1 : SMTs 0-31 GPUs 0-3 Memory All # jsrun --launch_distribution packed # Rank 0 Host 0 bound in RS 0 to SMT 0-7 # Rank 1 Host 0 bound in RS 0 to SMT 8-15 # Rank 2 Host 1 bound in RS 1 to SMT 0-15 # Rank 3 Host 1 bound in RS 1 to SMT 16-31 # jsrun --launch_distribution cyclic # Rank 0 Host 0 bound in RS 0 to SMT 0-7 # Rank 2 Host 0 bound in RS 0 to SMT 8-15 # Rank 1 Host 1 bound in RS 1 to SMT 0-15 # Rank 3 Host 1 bound in RS 1 to SMT 16-31 2 : {host: 0 ; cpu: {0:8},{8:8} ; gpu: {0,1} } 2 : {host: 1 ; cpu: {0:16},{16:16} ; gpu: {0,1,2,3} }
-
The following output is an MPMD specification:
# Rank 0 Host 0 bound in RS 0 to SMT 0-3 running App 0 # Rank 1-5 Host 1 bound in RS 1 to sets of 4 SMT-4 cores running App 1 # Rank 6-10 Host 0 bound in RS 1 to sets of 4 SMT-4 cores running App 2 launch_distribution : packed app 0 : ./a.out --manager app 1 : ./a.out --worker --file bucketA.txt app 2 : ./a.out --worker --file bucketB.txt : {host: 0; cpu: {0:4} } : app 0 5 : {host: 1; cpu: {4:16},{20:16},{36:16},{52:16},{68:16}; gpu: * } : app 1 5 : {host: 0; cpu: {4:16},{20:16},{36:16},{52:16},{68:16}; gpu: * } : app 2
-
The following output is a “Specific Ordering” example:
# RS 0 : SMTs 0-23 GPUs 0,1 Memory All # RS 1 : SMTs 0-47 GPUs 0-3 Memory All # RS 2 : SMTs 32-47 GPUs 2,3 Memory All # Rank 0 Host 0 bound in RS 0 to SMT 0-7 # Rank 1 Host 1 bound in RS 1 to SMT 16-31 # Rank 2 Host 1 bound in RS 1 to SMT 0-16 # Rank 3 Host 0 bound in RS 0 to SMT 16-23 # Rank 4 Host 1 bound in RS 1 to SMT 32-47 # Rank 5 Host 0 bound in RS 0 to SMT 8-15 # Rank 6 Host 0 bound in RS 2 to SMT 32-39 rank: 0,5,3 : {host: 0 ; cpu: {0:8},{8:8},{16:8} ; gpu: {0,1} } rank: 2,1,4 : {host: 1 ; cpu: {0:16},{16:16},{32:16}; gpu: {0,1,2,3} } rank: 6 : {host: 0 ; cpu: {32:8},{40:8} ; gpu: {2,3} }
-
The following output is a “Specific Ordering” example with some overlapping and overfull resource sets:
# Note that “memory: *” is not required, since it is the default behavior. # It is shown here for completeness. # Skip over any missing SMTs without warning skip-missing-cpus : allow # Allow overlapping resource sets without warning overlapping-rs : allow # Allow for oversubscription of CPU oversubscribe-cpu : allow # Resource sets: # RS 0 : Host 0 : SMTs 0-175 GPUs All Memory All # RS 1 : Host 0 : SMTs 0-175 GPUs All Memory All (overlaps with RS 0) # RS 2 : Host 1 : SMTs 0-175 GPUs All Memory All # RS 3 : Host 2 : SMTs 0-175 GPUs All Memory All # RS 4 : Host 2 : SMTs 0-175 GPUs 0,1 Memory All (overlaps with RS 3) # Rank 0 Host 0 bound in RS 0 to SMT 0- 87 # Rank 1 Host 0 bound in RS 0 to SMT 0- 87 (oversubscribes CPU) # Rank 2 Host 0 bound in RS 1 to SMT 88-175 # Rank 3 Host 0 bound in RS 1 to SMT 88-175 (oversubscribes CPU) # Rank 4-7 Host 1 bound in RS 2 to SMT 0-176 (effectively unbound) # Rank 8 Host 2 bound in RS 3 to SMT 0-176 (effectively unbound) # Rank 9 Host 2 bound in RS 4 to SMT 0-87 rank: 0,1 : {host: 0 ; cpu: {0-87},{0-87},* ; gpu: * ; memory: *} rank: 2,3 : {host: 0 ; cpu: {88-175},{88-175},* ; gpu: *} rank: 4-7 : {host: 1 ; cpu: * ; gpu: * } rank: 8 : {host: 2 ; cpu: * ; gpu: {0,1},* } rank: 9 : {host: 2 ; cpu: {0-87},* ; gpu: {0,1} }
-
The following output is a “Specific Ordering” example with “full” resource sets and effectively unbound processes, which is useful for sites wanting to place and order processes with JSM, but have a custom binder per node:
# Skip over any missing SMTs without warning skip-missing-cpus : allow # Allow overlapping resource sets without warning overlapping-rs : allow # Allow for oversubscription of CPU oversubscribe-cpu : allow rank: 0 : {host: 0 ; cpu: * ; gpu : * ; memory : *} rank: 1 : {host: 1 ; cpu: * ; gpu : * ; memory : *} rank: 2 : {host: 2 ; cpu: * ; gpu : * ; memory : *} rank: 3 : {host: 3 ; cpu: * ; gpu : * ; memory : *} rank: 4 : {host: 1 ; cpu: * ; gpu : * ; memory : *} rank: 5 : {host: 2 ; cpu: * ; gpu : * ; memory : *} rank: 6 : {host: 3 ; cpu: * ; gpu : * ; memory : *} rank: 7 : {host: 1 ; cpu: * ; gpu : * ; memory : *} rank: 8 : {host: 2 ; cpu: * ; gpu : * ; memory : *} rank: 9 : {host: 3 ; cpu: * ; gpu : * ; memory : *}
The following example is a shorter version of the scenario shown in step 11:
# Skip over any missing SMTs without warning skip-missing-cpus : allow
# Allow overlapping resource sets without warning overlapping-rs : allow
# Allow for oversubscription of CPU oversubscribe-cpu : allow
rank: 0 : {host: 0 ; cpu: * ; gpu : * ; memory : *}
rank: 1,4,7 : {host: 1 ; cpu: * ; gpu : * ; memory : *}
rank: 2,5,8 : {host: 2 ; cpu: * ; gpu : * ; memory : *}
rank: 3,6,9 : {host: 3 ; cpu: * ; gpu : * ; memory : *}
See also
jsrun(1)
Parent topic: Job Step Manager commands