--map-by unit option

When using the --map-by unit option, unit can be any of the following values:
  • hwthread
  • core
  • L1cache
  • L2cache
  • L3cache
  • socket
  • numa
  • board
  • node
--map-by unit is the most basic of the mapping policies, and makes process assignments by iterating over the specified unit until the process count reaches the number of available slots.
The following example shows the output (in verbose mode) of the --map-by unit option, where core is the specified unit.
% mpirun -host hostA:4,hostB:2 -map-by core ...
R0  hostA  [BB/../../../../../../..][../../../../../../../..]
R1  hostA  [../BB/../../../../../..][../../../../../../../..]
R2  hostA  [../../BB/../../../../..][../../../../../../../..]
R3  hostA  [../../../BB/../../../..][../../../../../../../..]
R4  hostB  [BB/../../../../../../..][../../../../../../../..]
R5  hostB  [../BB/../../../../../..][../../../../../../../..]

This is sometimes called a packed or latency binding because it tends to produce the fastest communication between ranks.

The following example shows the output (in verbose mode) of using the --map-by unit option, where slot is the specified unit.
% mpirun -host hostA:4,hostB:2 -map-by socket ...
R0  hostA  [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
R1  hostA  [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
R2  hostA  [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
R3  hostA  [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
R4  hostB  [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
R5  hostB  [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] 

In the preceding examples, -host hostA:4,hostB:2 indicates that the cluster has six slots (spaces in which a process can run). Each rank consumes one slot, and processes are assigned hardware elements by iterating over the specified unit until the available slots are consumed.

The ordering of these examples, is implicitly core and socket, respectively, so core and socket are iterated for each rank assignment. The binding is also implicitly core and socket, respectively, so the final binding is to the same element that was chosen by the mapping.

When options, such as the ranking unit and binding unit, are not explicitly specified, the -display-devel-map option can be used to display the implicit selections. In the preceding examples, the -display-devel-map includes the following, respectively:
Mapping policy:
      BYCORE  Ranking policy: CORE Binding policy: CORE:IF-SUPPORTED
Mapping policy:
      BYSOCKET  Ranking policy: SOCKET Binding policy: SOCKET:IF-SUPPORTED

If no binding options are specified, by default, Open MPI assumes --map-by-socket for jobs with more than two ranks. This produces the interleaved ordering in the preceding examples.

Note: IBM Spectrum® MPI enables binding by default when using the orted tree to launch jobs. The default binding for a less than, or fully subscribed node is --map-by-socket. In this case, users might see improved latency by using either the -aff latency or --map-by core option.
A natural hardware ordering can be created by specifying a smaller unit over which to iterate for ranking. For example:
% mpirun -host hostA:4,hostB:2 -map-by socket -rank-by core ...
R0  hostA  [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
R1  hostA  [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
R2  hostA  [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
R3  hostA  [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
R4  hostB  [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
R5  hostB  [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
A common binding pattern involves binding to cores, but spanning those core assignments over all of the available sockets. For example:
% mpirun -host hostA:4,hostB:2 -map-by socket -rank-by core -bind-to core ...
R0  hostA  [BB/../../../../../../..][../../../../../../../..]
R1  hostA  [../BB/../../../../../..][../../../../../../../..]
R2  hostA  [../../../../../../../..][BB/../../../../../../..]
R3  hostA  [../../../../../../../..][../BB/../../../../../..]
R4  hostB  [BB/../../../../../../..][../../../../../../../..]
R5  hostB  [../../../../../../../..][BB/../../../../../../..]
In this example, the final binding unit is smaller than the hardware selection that was made in the mapping step. As a result, the cores within the socket are iterated over for the ranks on the same socket. When the mapping unit and the binding unit differ, the -display-devel-map output can be used to display the mapping output from which the binding was taken. For example, at rank 0, the -display-devel-map output includes:
Locale:  [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
Binding: [BB/../../../../../../..][../../../../../../../..]

A possible purpose for this binding is to use all the available hardware resources such as cache and memory bandwidth. This is sometimes called a bandwidth binding, and is a good starting point for overall application performance. The amount of cache and memory bandwidth is maximized, and the ranks are ordered so that close ranks by index are near each other in the hardware as much as possible while still spanning the available sockets.

On the hardware used in these examples, socket and numa are the same. On some hardware it may be desirable to iterate the process placement over the NUMA nodes instead of over the sockets. In this case, -map-by numa can be used. For example:
% mpirun -host hostA:4,hostB:2 -map-by numa -rank-by core -bind-to core ...
R0  hostA  [BB/../../../../../../..][../../../../../../../..]
R1  hostA  [../BB/../../../../../..][../../../../../../../..]
R2  hostA  [../../../../../../../..][BB/../../../../../../..]
R3  hostA  [../../../../../../../..][../BB/../../../../../..]
R4  hostB  [BB/../../../../../../..][../../../../../../../..]
R5  hostB  [../../../../../../../..][BB/../../../../../../..]
Note: In Open MPI's terminology, numa refers to a NUMA node within a host, while node refers to the whole host.
In the following example, the host (node) is iterated for process assignments. The ranking unit is also implicitly node, so the ordering of the ranks alternates between the hosts as well. However, the binding unit defaults to the smaller socket element and, similar to the preceding bandwidth example, iterates over sockets for subsequent ranks that have the same node binding at the mapping step. For example:
% mpirun -host hostA:4,hostB:2 -map-by node ...
R0  hostA  [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
R1  hostB  [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
R2  hostA  [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
R3  hostB  [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
R4  hostA  [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
R5  hostA  [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]