Using RDMA with pagepool larger than 8GB
If you receive an RDMA error similar to this
when starting GPFS you most likely need to increase the value of the configuration parameter log_mtts_per_seg.
When GPFS starts with Mellanox InfiniBand RDMA (VERBS) enabled it maps all of the memory defined in pagepool into the RDMA (VERBS) driver. In fact it maps it twice so it is actually mapping 2x the memory defined by the pagepool parameter. By default the mlx4 driver can be mapped to about 32GiB of memory, which equates to just less than an 16GiB setting for GPFS pagepool.
The default number of log_num_mtt is 20 and 3 for log_mtts_per_seg.
So with this configuration (1MiB * 8 * 4K = 32GiB) 32GiB is the maximum memory that can be registered to InfiniBand based on the mtt resources configured for mlx4_core. Since GPFS registers twice the value of pagpool and there is some other MTT space used elsewhere, the maximum pagepool you can use with the default settings somewhere right below 16GiB.
You can increase the maximum amount of memory supported by increasing the value of log_mtts_per_seg. For example to support a pagepool of 24 GiB you increase log_mtts_per_seg to 4.
This equates to 64GiB as a maximum about of memory mapped.
These parameters are set on the mlx4_core module in /etc/modprobe.conf or place the line at the end of /etc/modprobe.d/mlx4_core.conf file, Depending on your version of linux.
So we're talking about log_num_mtt and log_mtts_per_seg, which are parameters
that control memory translation table (MTT).
MTT has segments, each segment has entries. Each entry can hold one translation,
which means that it can let you register one page.
log_num_mtt controls number of MTT segments (logarithmic scale), log_mtts_per_seg
controls number of entries per segment.
Each memory registration uses either whole segment, or multiples of segments.
You can't have two separate memory registrations in the same segment, even if
there are unused entries in the segment.
So what do we get? MTT fragmentation.
Larger segments - more internal fragmentation, but less segments used per registration.
Smaller segments - less fragmentation, but more segments per registration.
Every application is different, so YMMV. I don't have any extensive research to back
my statement, but I've been told that sometimes smaller segments have a benefit.
You can try both ways and see if there is a difference. There's big chance you won't see any.
As for 2x physical memory: because of MTT internal fragmentation, you need the MTT to
have more entries than there are physical pages in the memory. 2x seems enough.