Configuring Mellanox Memory Translation Table (MTT) for GPFS RDMA VERBS Operation

You need to configure the Mellanox Memory Translation Table (MTT) with correct page pool size for GPFS RDMA or Mellanox InfiniBand RDMA (VERBS) operation .

How GPFS pagepool size affects Mellanox InfiniBand RDMA (VERBS) configuration

Improperly configuring the Mellanox MTT can lead to the following problems:

  • Excessive logging of RDMA-related errors in the IBM Storage Scale log file.
  • Shutdown of the GPFS daemon due to memory limitations. This can result in the loss of NSD access if this occurs on an NSD server node.
To avoid such problems the size of the GPFS pagepool must be considered to properly configure the Mellanox MTT. For more information, see GPFS and memory.

Mellanox Variables

The Mellanox mlx4_core driver module has the following two parameters that control its MTT size and define the amount of memory that can be registered by the GPFS daemon. The parameters are log_num_mtt and log_mtts_per_seg and they are defined as a power of 2.

  • log_num_mtt defines the number of translation segments that are used.
  • log_mtts_per_seg defines the number of entries per translation segment.

Each log_mtts_per_seg maps a single page, as defined by the hardware architecture, to the mlx4_core driver. For example, setting the variable log_num_mtt to 20 results in a value of 1,048,576 (segments) which is 2 to the power of 20. Setting the variable log_mtts_per_seg to 3 results in the value of 8 (entries per segment) which is 2 to the power of 3. These parameters are set in the mlx4_core module of the /etc/modprobe.conf file, or on a line at the end of /etc/modprobe.d/mlx4_core.conf file, depending on your version of Linux®. Here is an example of how the parameters can be set in those files.

Options mlx4_core log_num_mtt=23 log_mtts_per_seg=0

To check the configuration of the mlx4 driver use the following command:

# cat /sys/module/mlx4_core/parameters/log_num_mtt
23
# cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
0

GPFS pagepool mapping

When the GPFS daemon starts, and the verbsRdma parameter is enabled, GPFS attempts to register the pagepool with the mlx4_core driver. Because the GPFS registers the pagepool twice, the values of the Mellanox parameters must allow the mapping memory to be at least twice the size of the GPFS pagepool. If the pagepool size is not a power of 2, the size is rounded up to the next power of 2 size. This rounded up size is used when registering the pagepool with the mlx4_core driver. If the attempt to map the GPFS pagepool to the mlx4_core driver fails the GPFS daemon will shut down and log messages similar to these.

VERBS RDMA Shutdown because pagepool could not be registered to Infiniband.
VERBS RDMA Try increasing Infiniband device MTTs or reducing pagepool size.

Example to support GPFS pagepool of 32GB

If the GPFS pagepool is set to 32 GB, then the mapping of the RDMA for this pagepool must be at least 64 GB. In addition to the two Mellanox configuration variables described previously, you need to know the page size that is used by the architecture on which IBM Storage Scale is running.
Note: The x86 architecture uses a page size of 4096 bytes (4 K) and Power® architecture (ppc64) uses a page size of 65536 (64 K). Here are the mappings for each architecture for GPFS pagepool of 32 GB.

x86:

log_num_mtt=24

log_mtts_per_seg=0

page size 4 K

2^log_num_mtt X 2^log_mtts_per_seg X page size

2^24 X 1 X 4096

16,777,216 X 1 X 4096 = 68,719,476,736 (64 GB)

ppc64:

log_num_mtt=20

log_mtts_per_seg=0

page size 64 K

2^log_num_mtt X 2^log_mtts_per_seg X page size

2^20 X 2^0 X 65,536

1,048,576 X 1 X 65,536 = 68,719,476,736 (64 GB)