Configuring GPUDirect Storage for IBM Storage Scale

After IBM Storage Scale is installed, the GPUDirect Storage (GDS) feature can be enabled by running the command mmchconfig verbsGPUDirectStorage=yes. This requires that IBM Storage Scale is stopped on all nodes.

You also need to set the following configuration options by using the mmchconfig command on the GDS clients and storage servers:

  • minReleaseLevel must be 5.1.2 or later.
  • verbsRdma=enable
  • verbsRdmaSend=yes
  • verbsPorts. The values must be compliant with the values of the rdma_dev_addr_list parameter that is at /etc/cufile.json, i.e. the IP addresses assigned to rdma_dev_addr_list in /etc/cufile.json need to be assigned to the RDMA devices listed in the IBM Storage Scale config variable verbsPorts. For more information, see the Configuring RDMA ports on the GPU client section below.

Configuring virtual fabrics

The RDMA subsystem within IBM Storage Scale is supporting virtual fabrics to control how RDMA ports on NSD clients and NSD servers communicate with each other through Queue Pairs. Only RDMA ports on the same virtual fabric communicate with each other. With this feature, it is possible to use GDS on setups with multiple separated InfiniBand fabrics.

An example of a virtual fabric definition with the virtual fabrics 0 and 1 is shown as follows:
verbsPorts mlx5_4/1/0 mlx5_5/1/0 mlx5_10/1/1 mlx5_11/1/1

For more information on Virtual fabrics, see the topic mmchconfig command.

Virtual fabric numbers used on GDS clients must be used on all NSD servers. Otherwise an error is thrown if an NSD server cannot reach the GPU client within its virtual fabric. There are no special configuration changes within the NVIDIA GDS software stack required for virtual fabrics. All IP addresses configured in /etc/cufile.json for the key rdma_dev_addr_list must be reachable by the NSD servers. The verbsPorts configuration variable needs to be set accordingly. If GDS I/O operations go through an RDMA port not listed in verbsPorts, it results in an I/O error and an error message is logged in the IBM Storage Scale log file. The verbsPorts syntax remains unchanged.

All NSD servers must have RDMA ports in all virtual fabrics that are configured on the NSD clients that perform I/O through GDS. For example, on the GDS clients, the RDMA ports are configured to use virtual fabric numbers 1, 2, 3, and 4. On the NSD server, RDMA ports with the same 4 virtual fabric numbers must be configured. When a GDS client submits a GDS request through an RDMA port on the virtual fabric number 4, but the NSD server does not have an RDMA port on virtual fabric number 4, the request fails and results in an I/O error in the GDS application. An error message in the IBM Storage Scale log file also gets recorded.

Configuring CUDA

The configuration file ("/etc/cufile.json") for CUDA and the GDS driver can be found on each GDS client.
Note: This topic describes the configuration that is necessary for IBM Storage Scale. For an in-depth discussion of these configuration options, see Installing GDS.
You need to update the following configuration settings in the CUDA file:
rdma_dev_addr_list
Defines the RDMA devices to be used as a list of IP addresses. The IP addresses (RoCE or IP over IB) specified must be consistent with the values that are set for the verbsPorts parameter on the GDS clients. For more information, see the Configuring RDMA ports on the GPU client section below.
rdma_load_balancing_policy
Specifies the load-balancing policy for RDMA memory registration. If the GDS client is a DGX, the following values must be set:
  • RoundRobin: For storage Network Interface Cards (NIC).
  • RoundRobinMaxMin: For compute NICs.

The default value is RoundRobin. For more information on DGX, see https://www.nvidia.com/en-us/data-center/dgx-systems/.

rdma_access_mask
Enables relaxed ordering. Set the value 0x1f.
"logging"."level"
Defines the log level. Set the values ERROR or WARN unless debug output is required. Setting log levels such as DEBUG and TRACE impacts performance.
use_poll_mode
Switches the NVIDIA driver between asynchronous and synchronous I/O modes. Set the value false for configuring GDS for IBM Storage Scale.
gds_write_support
For accelerated writes the following key/value has to be added in the file system-specific section ("fs":"gpfs"):
"gpfs": {                          
                  "gds_write_support": true                
}

Configuring RoCE

To enable RoCE for GDS, enable the general RoCE support for IBM Storage Scale. No special configuration settings are needed to enable the RoCE support for GDS.

To enable generic RoCE support, all RoCE adapter ports must have a proper IP configuration and these ports must be listed in the verbsPorts configuration variable. In addition, the verbsRdmaCm configuration variable must be enabled. This setting enables the RDMA Connection Manager, which is a prerequisite for using RoCE.

For more information, see Highly Efficient Data Access with RoCE on IBM Elastic Storage® Systems and IBM Spectrum® Scale.

To configure the CUDA software stack, the configuration file /etc/cufile.json must have in the key rdma_dev_addr_list all or some of the IP addresses for the RoCE ports configured in the GPFS verbsPorts configuration variable.

Configuring RDMA ports on the GPU client

Configuring the RDMA ports requires the following steps:
  1. Configuration of the RDMA ports to be used by IBM Storage Scale and GDS on the GPU client machine.
    Specify the RDMA ports to be used in the verbsPorts config option, for example:
    root:~# mmlsconfig verbsports
    verbsPorts mlx5_4/1 mlx5_5/1 mlx5_10/1 mlx5_11/1

    These are the ports used by IBM Storage Scale and they can also be used by GDS. GDS can use all ports but does not have to.

  2. Determine the IP addresses for the RDMA ports.
    Determine the device names by using the ibdev2netdev command:
    # ibdev2netdev
    mlx5_4 port 1 ==> enp97s0f0 (Up)
    mlx5_5 port 1 ==> enp97s0f1 (Up)
    mlx5_10 port 1 ==> enp225s0f0 (Up)
    mlx5_11 port 1 ==> enp225s0f1 (Up)
    List the IP addresses assigned by using the ip command:
    # ip -br -4 a
    enp97s0f0        UP             192.168.1.20/24
    enp97s0f1        UP             192.168.1.21/24
    enp225s0f0       UP             192.168.1.22/24
    enp225s0f1       UP             192.168.1.23/24
  3. Use these IP addresses in the config /etc/cufile.json file:

    The config file /etc/cufile.json has an entry for the RDMA device address list called rdma_dev_addr_list.

    This should be set to all or some of the IPs found in the previous step.

    "rdma_dev_addr_list": ["192.168.1.20", "192.168.1.21", "192.168.1.22", "192.168.1.23"],