Contribute in GitHub:

Running containerized applications

A container is a mechanism that bundles an application and its execution environment into a sharable image. The image can be shared between execution environments and users, which lead to improved application usability, reproducibility, and portability. Container runtimes leverage Linux® cgroups, namespaces, and capabilities to provide an isolated execution environment for the containerized application. Containers can natively access devices such as network devices and GPUs if the container runtime has the necessary level of support.

Spectrum MPI currently supports the Singularity container runtime.

There are two different container modes that are commonly used in HPC environments:

Rank contained mode: In this mode, there is one container per application process (for example, MPI rank). If there are multiple application processes assigned to a node, then there are multiple container instances on that node. The Spectrum MPI runtime from the host system (outside of the container) is used to launch the container runtime with the application container image for each process on each allocated compute node.

1.  The rank contained mode requires that the *Spectrum MPI* runtime on the host system outside of the container is compatible with the *Spectrum MPI* installed inside the container image.
2.  In this mode, it is recommended to bind mount the *Spectrum MPI* installed on the host system over the top of the *Spectrum MPI* inside their container. This method is the easiest way to guarantee compatibility across the container boundary.
3.  The *Spectrum MPI* version inside the container must be at version 10.3.1.0 or later.

Fully contained mode: In this mode, there is one container per allocated compute node. If there are multiple application processes assigned to that node, then all processes run in the same container instance. The Spectrum MPI runtime from inside the container is used to launch the application container image on each compute node.
```
1.  The fully contained mode does not require *Spectrum MPI* to be installed on the host system since the runtime components are part of the container image.
2.  This mode provides users with a more portable image and execution environment since there are fewer external dependencies to the container image.
```

Spectrum MPI provides options to support both containerization modes. The --container option takes a comma-separated list of directives to pick the mode and control the launch environment. Additionally, Spectrum MPI provides enhancements to choose from multiple Spectrum MPI installs within the container and to customize the containerized environment through special environment variables.

If the container requires the use of Mellanox InfiniBand or NVIDIA GPUs, the user must ensure that their container image contains compatible versions of Mellanox MOFED and NVIDIA CUDA with those installed on the host system. Container images cannot contain the kernel modules that are required by these libraries. As such, the container image must contain a compatible version of the user-space library.

The Spectrum MPI runtime requires the user to set the MPIRUN_CONTAINER_CMD environment variable before it calls mpirun with any of the --container options. The MPIRUN_CONTAINER_CMD environment variable specifies the container runtime command to use when launching the container image. This command might be a direct, parameterized call to the container runtime or a script that then calls the container runtime. Spectrum MPI prefixes the binary to be launched with the string specified by this environment variable.

The MPIRUN_CONTAINER_OPTIONS environment variable can be used instead of passing the --container option to the mpirun command. The value set in the MPIRUN_CONTAINER_OPTIONS environment variable is the same string that the user would pass to the --container option.

Running containers in a rank contained mode

The mpirun command's --container rank option tells the Spectrum MPI runtime that the application is to be launched in a rank contained mode. The user must set the MPIRUN_CONTAINER_CMD environment variable, which tells Spectrum MPI how to activate the container runtime for each process in this application launch.

Spectrum MPI uses an assistant script inside the container to help negotiate container runtime options and environment settings across the container boundary. For example, when you launch the application in rank contained mode by running the following commands:

export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif"
mpirun --container rank ./a.out arg1 arg2

Each MPI rank is executed as:

singularity exec --nv myapp.sif $MPI_ROOT/container/bin/incontainer.pl ./a.out arg1 arg2

If the user needs to replace or alter the assistant script, they can pass their own script by using the --container assist:<path> option. The path that you provide must be valid inside the container image. By default, the assistant script is set to the value seen in the previous example. Note that the --container option takes a comma-separated list of values. For example, a user might specify the following commands to launch a rank contained containerized application with a custom assistant script:

export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif"

mpirun --container rank,assist:/examples/helper.py ./a.out arg1 arg2

then each MPI rank is executed as:

singularity exec --nv myapp.sif /examples/helper.py ./a.out arg1 arg2

The MPI_ROOT variable points to the Spectrum MPI installed inside the container image. It recommended that MPI_ROOT be set as an environment variable inside the container image. However, if the user does not have it set or wants to use a different MPI_ROOT inside the container then they can use the --container root:<path\> option with the mpirun command to set this environment variable. The path that you provide must be valid inside the container image.

Note: Users should not start containers without using the --container options and the MPIRUN_CONTAINER_CMD environment variable because Spectrum MPI requires the use of an assistant script to correctly setup the container environment.

Running containers in a fully contained mode

The --container all and --container orted options to the mpirun to tell the Spectrum MPI runtime that the application is to be launched in a fully contained mode. The user must set the MPIRUN_CONTAINER_CMD environment variable, which tells Spectrum MPI how to activate the container runtime to setup the execution environment. There is no assistant script that is used in the orted mode, so the --container assist option is ignored in the orted mode. The assist script is used in the all mode before relaunching the mpirun process.

The --container all option to the mpirun command causes mpirun to start the container image locally and reexecute mpirun from within that container instance. The contained mpirun launches on each allocated compute node a container instance with the Spectrum MPI daemon(orted) inside that container image. Essentially wrapping the remote daemons in a container instance. The application processes are launched in the same container instance as the Spectrum MPI runtime.

For example:

export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif"

mpirun --container all ./a.out arg1 arg2

then the following is executed instead:

singularity exec --nv myapp.sif mpirun ./a.out arg1 arg2

Additional environment variables meaningful to mpirun are set to tell mpirun how to launch the remote daemons in their own private container instances.

The --container orted option to mpirun is similar to the all variant except that it assumes that mpirun is already executed from within a container instance. In this mode, mpirun does not reexecute itself, but sets up the environment to place the Spectrum MPI daemons inside container instances on the compute nodes. No container assistant script is used in the orted mode. As such the assist and root options, and SMPICONTAINERENV prefixed environment variables have no impact in the orted mode. The orted mode is helpful if the user needs to pre-process or post-process data from inside the same container instance as mpirun, or run multiple mpirun invocations from within the same container instance. A user can create a script.

For example:

$ cat run-test.sh
#!/bin/bash
export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif"
./pre-process.py data.in
mpirun --container orted ./a.out arg1 arg2
./post-process.py data.out

The user can run the script by invoking the container runtime around this script:

singularity exec -nv myapp.sif ./run-test.sh

In this example, the user is creating the container around mpirun, and mpirun is in charge of creating the container around the Spectrum MPI daemons (orted) and application processes.

Customizing environment variables for the container environment

Depending on the container runtime, some environment variables might not transfer across the container boundary. Spectrum MPI allows users to prefix the environment variables that they need moved across the container boundary.

Adding the prefix SMPI_CONTAINERENV_ to an environment variable passes that environment variable inside to the container without the prefix. Adding the SMPI_CONTAINERENV_ prefix is helpful when you need to pass an environment variable that the container runtime would otherwise strip from the environment (for example, LD_PRELOAD). To propagate the environment variables to the remote nodes, each environment variable must have the SMPI_CONTAINERENV_ prefix and must be listed with the -x option on the mpirun command line.

For example:

export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif"
export SMPI_CONTAINERENV_FOO=bar
mpirun --container rank -x SMPI_CONTAINERENV_FOO ./a.out

This sets the environment variable FOO to the value bar inside the container. If the environment variable exists in the environment of the container instance, then it is replaced with this value.

Adding the prefix SMPI_CONTAINERENVPREPEND to an environment variable prepends values to an existing environment variable inside the container. Adding the SMPI_CONTAINERENVPREPEND prefix is helpful if you need to extend the default environment variable (that is, PATH).

For example:

export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif"
export SMPI_CONTAINERENV_PREPEND_PATH="/examples/bin"
mpirun --container rank -x SMPI_CONTAINERENV_PREPEND_PATH ./a.out

This prepends /examples/bin to the environment variable PATH inside the container instance. A (:) separator is added after the value if the environment variable exists in the container. If the environment variable does not exist in the container, then it is set to this value. Spectrum MPI places additional items in the PATH and LD_LIBRARY_PATH environment variable before any values specified by this mechanism.

Adding the prefix SMPI_CONTAINERENVAPPEND to an environment variable appends values to an existing environment variable inside the container. Adding the SMPI_CONTAINERENVAPPEND prefix is helpful if you need to extend the default environment variable (for example, PATH).

For example:

export MPIRUN_CONTAINER_CMD="singularity exec --nv myapp.sif"
export SMPI_CONTAINERENV_APPEND_PATH="/examples/bin"
mpirun --container rank -x SMPI_CONTAINERENV_APPEND_PATH ./a.out

This sample code prepends /examples/bin to the environment variable PATH inside the container instance. A (:) separator is added before the value if the environment variable exists in the container. If the environment variable does not exist in the container, then it is set to this value.

Parent topic: Running applications
Top level index: IBM Spectrum MPI