Using RDMA with pagepool larger than 8GB

This page has not been liked. Updated 1/15/14, 3:19 PM by ScottGPFSTags: None

If you receive an RDMA error similar to this

Wed Apr 20 09:53:38 CEST 2011: runmmfs starting
Removing old /var/adm/ras/mmfs.log.* files:
Unloading modules from /lib/modules/2.6.18-194.el5/extra
Loading modules from /lib/modules/2.6.18-194.el5/extra
Module                  Size  Used by
mmfs26               1656104  0
mmfslinux             322632  1 mmfs26
tracedev               67020  2 mmfs26,mmfslinux
Wed Apr 20 09:53:39.976 2011: mmfsd initializing. {Version:   Built: Feb 15 2011 11:25:39} ...
Wed Apr 20 09:53:49.918 2011: OpenSSL library loaded and initialized.
Wed Apr 20 09:53:51.710 2011: VERBS RDMA starting.
Wed Apr 20 09:53:51.716 2011: VERBS RDMA library (version >= 1.1) loaded and initialized.
Wed Apr 20 09:53:54.268 2011: VERBS RDMA ibv_reg_mr err 12 device mlx4_0 addr 0x4000000000 len 8388608 KB. Try increasing device MTTs.
Wed Apr 20 09:53:56.748 2011: VERBS RDMA ibv_reg_mr err 12 device mlx4_1 addr 0x4000000000 len 8388608 KB. Try increasing device MTTs.
Wed Apr 20 09:53:56.749 2011: VERBS RDMA library unloaded.
Wed Apr 20 09:53:56.748 2011: VERBS RDMA failed to start.

when starting GPFS you most likely need to increase the value of the configuration parameter log_mtts_per_seg.

When GPFS starts with Mellanox InfiniBand RDMA (VERBS) enabled it maps all of the memory defined in pagepool into the RDMA (VERBS) driver. In fact it maps it twice so it is actually mapping 2x the memory defined by the pagepool parameter. By default the mlx4 driver can be mapped to about 32GiB of memory, which equates to just less than an 16GiB setting for GPFS pagepool.

To check the configuration of the mlx4 driver look at

# more /sys/module/mlx4_core/parameters/log_num_mtt
# more /sys/module/mlx4_core/parameters/log_mtts_per_seg

The default number of log_num_mtt is 20 and 3 for log_mtts_per_seg.

log_num_mtt = 20 - This value is used as 2^log_num_mtt or  2^20 = 1MiB
log_mtts_per_seg = 3 - This value is used as 2^log_mtts_per_seg or 2^3 = 8

So with this configuration (1MiB * 8 * 4K = 32GiB) 32GiB is the maximum memory that can be registered to InfiniBand based on the mtt resources configured for mlx4_core. Since GPFS registers twice the value of pagpool and there is some other MTT space used elsewhere, the maximum pagepool you can use with the default settings somewhere right below 16GiB.

The formula to computer the maximum value of pagepool when using RDMA is:

2^log_num_mtt  x  2^log_mtts_per_seg * x PAGE_SIZE > ( 2x pagepool )

You can increase the maximum amount of memory supported by increasing the value of log_mtts_per_seg. For example to support a pagepool of 24 GiB you increase log_mtts_per_seg to 4.

log_num_mtt = 20
log_mtts_per_seg = 4

This equates to 64GiB as a maximum about of memory mapped.

2^20 bytes x 2^4 x 4K = 64GiB

These parameters are set on the mlx4_core module in /etc/modprobe.conf or place the line at the end of /etc/modprobe.d/mlx4_core.conf file, Depending on your version of linux.

options mlx4_core log_num_mtt=20 log_mtts_per_seg=4
These parameters as described above are generally sufficient for GPFS pagepool usage. RDMA may also be used by other functions, especially the MPI stack, and it may be neccesary to tune MTTs for larger memory allocations, where an application may want to transfer most of the system's memory space via RDMA. The optimal choice of log_num_mtt and log_mtts_per_seg to achieve a given memory size may vary with the application characteristics. A recent good discussion of this appeared in the OpenMPI Development mailing list:
For convenience, the key section of that article is quoted below:

So we're talking about log_num_mtt and log_mtts_per_seg, which are parameters

that control memory translation table (MTT).

MTT has segments, each segment has entries. Each entry can hold one translation,

which means that it can let you register one page.

log_num_mtt controls number of MTT segments (logarithmic scale), log_mtts_per_seg

controls number of entries per segment.

Each memory registration uses either whole segment, or multiples of segments.

You can't have two separate memory registrations in the same segment, even if

there are unused entries in the segment.

So what do we get? MTT fragmentation.

Larger segments - more internal fragmentation, but less segments used per registration.

Smaller segments - less fragmentation, but more segments per registration.

Every application is different, so YMMV. I don't have any extensive research to back

my statement, but I've been told that sometimes smaller segments have a benefit.

You can try both ways and see if there is a difference. There's big chance you won't see any.

As for 2x physical memory: because of MTT internal fragmentation, you need the MTT to

have more entries than there are physical pages in the memory. 2x seems enough.

-- YK