Configuring GPUDirect Storage for IBM Storage Scale
After IBM Storage Scale is installed, the GPUDirect Storage (GDS) feature can be enabled by running the command mmchconfig verbsGPUDirectStorage=yes. This requires that IBM Storage Scale is stopped on all nodes.
You also need to set the following configuration options by using the mmchconfig command on the GDS clients and storage servers:
minReleaseLevel
must be 5.1.2 or later.verbsRdma=enable
verbsRdmaSend=yes
verbsPorts
. The values must be compliant with the values of the rdma_dev_addr_list parameter that is at /etc/cufile.json, i.e. the IP addresses assigned to rdma_dev_addr_list in /etc/cufile.json need to be assigned to the RDMA devices listed in the IBM Storage Scale config variable verbsPorts. For more information, see the Configuring RDMA ports on the GPU client section below.
Configuring virtual fabrics
The RDMA subsystem within IBM Storage Scale is supporting virtual fabrics to control how RDMA ports on NSD clients and NSD servers communicate with each other through Queue Pairs. Only RDMA ports on the same virtual fabric communicate with each other. With this feature, it is possible to use GDS on setups with multiple separated InfiniBand fabrics.
verbsPorts mlx5_4/1/0 mlx5_5/1/0 mlx5_10/1/1 mlx5_11/1/1
For more information on Virtual fabrics, see the topic mmchconfig command.
Virtual fabric numbers used on GDS clients must be used on all NSD servers. Otherwise an
error is thrown if an NSD server cannot reach the GPU client within its virtual fabric. There are no
special configuration changes within the NVIDIA GDS software stack required for virtual fabrics. All
IP addresses configured in /etc/cufile.json for the key
rdma_dev_addr_list must be reachable by the NSD servers. The
verbsPorts
configuration variable needs to be set accordingly. If GDS I/O
operations go through an RDMA port not listed in verbsPorts
, it results in an I/O
error and an error message is logged in the IBM
Storage Scale log file. The verbsPorts
syntax
remains unchanged.
All NSD servers must have RDMA ports in all virtual fabrics that are configured on the NSD clients that perform I/O through GDS. For example, on the GDS clients, the RDMA ports are configured to use virtual fabric numbers 1, 2, 3, and 4. On the NSD server, RDMA ports with the same 4 virtual fabric numbers must be configured. When a GDS client submits a GDS request through an RDMA port on the virtual fabric number 4, but the NSD server does not have an RDMA port on virtual fabric number 4, the request fails and results in an I/O error in the GDS application. An error message in the IBM Storage Scale log file also gets recorded.
Configuring CUDA
- rdma_dev_addr_list
- Defines the RDMA devices to be used as a list of IP addresses. The IP addresses (RoCE or IP over
IB) specified must be consistent with the values that are set for the
verbsPorts
parameter on the GDS clients. For more information, see the Configuring RDMA ports on the GPU client section below. - rdma_load_balancing_policy
- Specifies the load-balancing policy for RDMA memory registration. If the GDS client is a DGX,
the following values must be set:
- RoundRobin: For storage Network Interface Cards (NIC).
- RoundRobinMaxMin: For compute NICs.
The default value is RoundRobin. For more information on DGX, see https://www.nvidia.com/en-us/data-center/dgx-systems/.
- rdma_access_mask
- Enables relaxed ordering. Set the value
0x1f
. - "logging"."level"
- Defines the log level. Set the values
ERROR
orWARN
unless debug output is required. Setting log levels such asDEBUG
andTRACE
impacts performance. - use_poll_mode
- Switches the NVIDIA driver between asynchronous and synchronous I/O modes. Set the value
false
for configuring GDS for IBM Storage Scale. - gds_write_support
- For accelerated writes the following key/value has to be added in the file system-specific
section (
"fs":"gpfs"
):"gpfs": { "gds_write_support": true }
Configuring RoCE
To enable RoCE for GDS, enable the general RoCE support for IBM Storage Scale. No special configuration settings are needed to enable the RoCE support for GDS.
To enable generic RoCE support, all RoCE adapter ports must have a proper IP configuration and these ports must be listed in the verbsPorts configuration variable. In addition, the verbsRdmaCm configuration variable must be enabled. This setting enables the RDMA Connection Manager, which is a prerequisite for using RoCE.
For more information, see Highly Efficient Data Access with RoCE on IBM Elastic Storage® Systems and IBM Spectrum® Scale.
To configure the CUDA software stack, the configuration file /etc/cufile.json must have in the key rdma_dev_addr_list all or some of the IP addresses for the RoCE ports configured in the GPFS verbsPorts configuration variable.
Configuring RDMA ports on the GPU client
- Configuration of the RDMA ports to be used by IBM
Storage Scale and GDS on the GPU client machine.Specify the RDMA ports to be used in the
verbsPorts
config option, for example:root:~# mmlsconfig verbsports verbsPorts mlx5_4/1 mlx5_5/1 mlx5_10/1 mlx5_11/1
These are the ports used by IBM Storage Scale and they can also be used by GDS. GDS can use all ports but does not have to.
- Determine the IP addresses for the RDMA ports.Determine the device names by using the ibdev2netdev command:
# ibdev2netdev mlx5_4 port 1 ==> enp97s0f0 (Up) mlx5_5 port 1 ==> enp97s0f1 (Up) mlx5_10 port 1 ==> enp225s0f0 (Up) mlx5_11 port 1 ==> enp225s0f1 (Up)
List the IP addresses assigned by using the ip command:# ip -br -4 a enp97s0f0 UP 192.168.1.20/24 enp97s0f1 UP 192.168.1.21/24 enp225s0f0 UP 192.168.1.22/24 enp225s0f1 UP 192.168.1.23/24
- Use these IP addresses in the config /etc/cufile.json file:
The config file /etc/cufile.json has an entry for the RDMA device address list called
rdma_dev_addr_list
.This should be set to all or some of the IPs found in the previous step.
"rdma_dev_addr_list": ["192.168.1.20", "192.168.1.21", "192.168.1.22", "192.168.1.23"],