Configuring Mellanox Memory Translation Table (MTT) for GPFS RDMA VERBS Operation
You need to configure the Mellanox Memory Translation Table (MTT) with correct page pool size for GPFS RDMA or Mellanox InfiniBand RDMA (VERBS) operation .
How GPFS pagepool size affects Mellanox InfiniBand RDMA (VERBS) configuration
Improperly configuring the Mellanox MTT can lead to the following problems:
- Excessive logging of RDMA-related errors in the IBM Storage Scale log file.
- Shutdown of the GPFS daemon due to memory limitations. This can result in the loss of NSD access if this occurs on an NSD server node.
Mellanox Variables
The Mellanox mlx4_core driver module has the following two parameters that control its MTT size and define the amount of memory that can be registered by the GPFS daemon. The parameters are log_num_mtt and log_mtts_per_seg and they are defined as a power of 2.
- log_num_mtt defines the number of translation segments that are used.
- log_mtts_per_seg defines the number of entries per translation segment.
Each log_mtts_per_seg maps a single page, as defined by the hardware architecture, to the mlx4_core driver. For example, setting the variable log_num_mtt to 20 results in a value of 1,048,576 (segments) which is 2 to the power of 20. Setting the variable log_mtts_per_seg to 3 results in the value of 8 (entries per segment) which is 2 to the power of 3. These parameters are set in the mlx4_core module of the /etc/modprobe.conf file, or on a line at the end of /etc/modprobe.d/mlx4_core.conf file, depending on your version of Linux®. Here is an example of how the parameters can be set in those files.
Options mlx4_core log_num_mtt=23 log_mtts_per_seg=0
To check the configuration of the mlx4 driver use the following command:
# cat /sys/module/mlx4_core/parameters/log_num_mtt
23
# cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
0
GPFS pagepool mapping
When the GPFS daemon starts, and the verbsRdma parameter is enabled, GPFS attempts to register the pagepool with the mlx4_core driver. Because the GPFS registers the pagepool twice, the values of the Mellanox parameters must allow the mapping memory to be at least twice the size of the GPFS pagepool. If the pagepool size is not a power of 2, the size is rounded up to the next power of 2 size. This rounded up size is used when registering the pagepool with the mlx4_core driver. If the attempt to map the GPFS pagepool to the mlx4_core driver fails the GPFS daemon will shut down and log messages similar to these.
VERBS RDMA Shutdown because pagepool could not be registered to Infiniband.
VERBS RDMA Try increasing Infiniband device MTTs or reducing pagepool size.
Example to support GPFS pagepool of 32GB
x86:
log_num_mtt=24
log_mtts_per_seg=0
page size 4 K
2^log_num_mtt X 2^log_mtts_per_seg X page size
2^24 X 1 X 4096
16,777,216 X 1 X 4096 = 68,719,476,736 (64 GB)
ppc64:
log_num_mtt=20
log_mtts_per_seg=0
page size 64 K
2^log_num_mtt X 2^log_mtts_per_seg X page size
2^20 X 2^0 X 65,536
1,048,576 X 1 X 65,536 = 68,719,476,736 (64 GB)