RDMA tuning

Read about tuning RDMA attributes to avoid problems in configurations with InfiniBand.

Settings for IBM Storage Scale 5.0.x and later

Registering the page pool to InfiniBand
If the GPFS daemon cannot register the page pool to InfiniBand, it fails with the following mmfs log messages:
VERBS RDMA Shutdown because pagepool could not be registered to Infiniband.
VERBS RDMA Try increasing Infiniband device MTTs or reducing pagepool size.
To resolve this problem, try adjusting the following mlx4_core module parameters for the Mellanox Translation Tables (MTTs). This adjustment does not apply to mlx5_core parameters.
  1. Set log_mtts_per_seg to 0. This value is the recommended one.
  2. Increase the value of log_num_mtt.
For more information see the following links:

Enabling verbsRdmaSend
The verbsRdmaSend attribute of the mmchconfig command enables or disables the use of InfiniBand RDMA rather than TCP for most GPFS daemon-to-daemon communications. When the attribute is disabled, only data transfers between an NSD server and an NSD client are eligible for RDMA. When the attribute is enabled, the GPFS daemon uses InfiniBand RDMA connections for daemon-to-daemon communications only with nodes that are at IBM Storage Scale 5.0.0 or later. For more information, see mmchconfig command.

Settings for IBM Storage Scale 4.2.3.x

Registering the page pool to InfiniBand
Follow the instructions for registering the page pool to InfiniBand in Settings for IBM Storage Scale 5.0.x and later earlier in this topic.

Enabling verbsRdmaSend
Read the discussion of setting verbsRdmaSend in Settings for IBM Storage Scale 5.0.x and later earlier in this topic. For 4.2.3.x, be aware of the following points:
  • Do not enable verbsRdmaSend in clusters greater than 500 nodes.
  • Disable verbsRdmaSend if either of the following types of error appears in the mmfs log:
    • Out of memory errors
    • InfiniBand error IBV_WC_RNR_RETRY_EXC_ERR

Setting scatterBufferSize in very large clusters (> 2100 nodes)
The scatterBufferSize attribute of the mmchconfig command has a default value of 32768, which provides good performance under most conditions. However, if the CPU use on the NSD I/O servers is high and client I/O is lower than expected, increasing the value of scatterBufferSize might improve performance. Try the following settings:
  • For Mellanox FDR 10 InfiniBand: 131072.
  • For Mellanox FDR 14 InfiniBand: 262144.
This attribute is not described in regular IBM Storage Scale documentation.

Setting verbsRdmasPerNode in large clusters (> 100 nodes)
The verbsRdmasPerNode attribute of the mmchconfig command sets the maximum number of RDMA data transfer requests that can be active at the same time on a single node. The default value is 1000. If the cluster is large (more than 100 nodes) the suggested value is the same value that is set for the attribute nsdMaxWorkerThreads.

This attribute is supported only in IBM Storage Scale version 4.2.x. For more information, see mmchconfig command in Command reference for IBM Storage Scale 4.2.3.

Suggested CPU tuning for Sandy Bridge processors

For Intel Core Sandy Bridge processors, if RDMA performance is less than expected, ensure that the C-states that reduce CPU voltage are disabled on the affected nodes.