APAR status
Closed as program error.
Error description
When client X recreates all memory registrations to server Y, this may cause RDMA errors for all the other NSD servers that have RDMA in progress to client X because other NSD servers are using the "old" memory registrations that are no longer valid. Reported in: Spectrum Scale 4.2.3.2 on CentOS Linux 7 Known Impact: Deadlock Error Message such as: VERBS RDMA async event IBV_EVENT_QP_ACCESS_ERR on mlx5_0 qp 0x7ffd98019a98. VERBS RDMA closed connection to <ip addr> (hostname) on mlx5_0 port 1 fabnum 0 error 733 index 10 VERBS RDMA closed connection to <ip addr> (hostname) on mlx5_0 port 1 fabnum 0 error 733 index 27
Local fix
Problem summary
Excessive RDMA errors are being logged. When an NSD client node attempts to send, or sends an nsdMsgReadExt or nsdMsgWriteExt to the NSD server that uses RDMA, the NSD client node tscomm layer may return an error if the message is not sent, or the NSD server may return an error. If the NSD server initiated the RDMA, and the RDMA fails, for example, with error IBV_WC_RETRY_EXC_ERR, the NSD server replies with E_RDMA. The NSD client node processes the E_RDMA error by recreating the RDMA connection, this works fine and does not cause any RDMA message to be logged beyond the NSD client node and the NSD server node (i.e. working as designed). f the error code is E_IO or E_NOT_NSD_SERVER, we know the NSD server did not attempt an RDMA read or write, the NSD client node will not destroy the RDMA connection, no RDMA error is logged, the NSD client node will retry network IO over TCP. The NSD client node will process any other error by recreating all memory registrations mapping the client pagepool to InfiniBand. When the NSD client node recreates all memory registrations, this will have a rippling effect throughout the cluster with respect to RDMA errors being logged. When NSD client node X recreates all memory registrations to NSD server Y, this may cause RDMA errors for all the other NSD servers that have RDMA in progress to NSD client node X because other NSD servers are using the "old" memory registrations that are no longer valid. This is why you see RDMA errors logged at about the same time in the mmfs.log to various NSD servers. The reason memory registrations are recreated for errors that are not E_RDMA, E_IO, and E_NOT_NSD_SERVER, and soon to be added E_NODEV and E_DISK_UNAVAIL, and E_MSGSIZE, is that we have to guarantee that the RDMA operation requested by the NSD client node will not be possible after the error is processed so that we avoid remote nodes performing RDMA to the pagepool when we do not expect it. However, recreating all memory registrations is a brute force way to accomplish this and causes excessive RDMA errors to be logged.
Problem conclusion
Instead of recreating all memory registrations, we can accomplish the same if we simply transition the QP that for the RDMA connection to state ERR. This will only impact the local node and remote node in terms of logging RDMA errors. As part of the fix, the error IBV_EVENT_QP_ACCESS_ERR will no longer be logged, it is not necessary, and the error is a by-product of breaking the RDMA connection. An undocumented boolean configuration option verbsRdmaEnableDeregMemBypass controlling enabling the code that transitions the QP to state ERR instead of recreating all memory registrations was added. It is enabled by default. An undocumented boolean configuration option verbsRdmaEnableVerboseLogging controlling verbose RDMA logging was added. It is disabled by default. When enabled, the error IBV_EVENT_QP_ACCESS_ERR is logged.
Temporary fix
Comments
APAR Information
APAR number
IJ01910
Reported component name
SPECTRUM SCALE
Reported component ID
5725Q01LX
Reported release
423
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2017-11-15
Closed date
2018-02-05
Last modified date
2018-02-05
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPECTRUM SCALE
Fixed component ID
5725Q01LX
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"423","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSFKCN","label":"General Parallel File System"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"423","Edition":"","Line of Business":{"code":"","label":""}}]
Document Information
Modified date:
05 February 2018