IJ01910: EXCESSIVE RDMA ERRORS LOGGED

APAR status

Closed as program error.

Error description

When client X recreates all memory registrations to
server
Y, this may cause RDMA errors for all the other NSD
servers
that have RDMA in progress to client X because other NSD
servers are using the "old" memory registrations that are
no longer valid.

Reported in:
Spectrum Scale 4.2.3.2 on CentOS Linux 7

Known Impact:
Deadlock

Error Message such as:
VERBS RDMA async event IBV_EVENT_QP_ACCESS_ERR on mlx5_0
qp 0x7ffd98019a98.
VERBS RDMA closed connection to <ip addr> (hostname) on
mlx5_0 port 1 fabnum 0 error 733 index 10
VERBS RDMA closed connection to <ip addr> (hostname) on
mlx5_0 port 1 fabnum 0 error 733 index 27

Local fix

Problem summary

Excessive RDMA errors are being logged.  When an NSD
client node attempts to send, or sends an nsdMsgReadExt
or nsdMsgWriteExt to the NSD server that uses RDMA, the
NSD client node tscomm layer may return an error if the
message is not sent, or the NSD server may return an
error.

If the NSD server initiated the RDMA, and the RDMA fails,
for example, with error IBV_WC_RETRY_EXC_ERR, the NSD
server replies with E_RDMA.  The NSD client node
processes the E_RDMA error by recreating the RDMA
connection, this works fine and does not cause any RDMA
message to be logged beyond the NSD client node and the
NSD server node (i.e. working as designed).

f the error code is E_IO or E_NOT_NSD_SERVER, we know the
NSD server did not attempt an RDMA read or write, the NSD
client node will not destroy the RDMA connection, no RDMA
error is logged, the NSD client node will retry network IO
over TCP.

The NSD client node will process any other error by
recreating all memory registrations mapping the client
pagepool to InfiniBand.  When the NSD client node recreates
all memory registrations, this will have a rippling effect
throughout the cluster with respect to RDMA errors being
logged.  When NSD client node X recreates all memory
registrations to NSD server Y, this may cause RDMA errors
for all the other NSD servers that have RDMA in progress to
NSD client node X because other NSD servers are using the
"old" memory registrations that are no longer valid.  This
is why you see RDMA errors logged at about the same time
in the mmfs.log to various NSD servers.

The reason memory registrations are recreated for errors
that are not E_RDMA, E_IO, and E_NOT_NSD_SERVER, and soon
to be added E_NODEV and E_DISK_UNAVAIL, and E_MSGSIZE, is
that we have to guarantee that the RDMA operation requested
by the NSD client node will not be possible after the error
is processed so that we avoid remote nodes performing RDMA
to the pagepool when we do not expect it.  However,
recreating all memory registrations is a brute force way
to accomplish this and causes excessive RDMA errors to be
logged.

Problem conclusion

Instead of recreating all memory registrations, we can
accomplish the same if we simply transition the QP that for
the RDMA connection to state ERR.  This will only impact
the local node and remote node in terms of logging
RDMA errors.

As part of the fix, the error IBV_EVENT_QP_ACCESS_ERR
will no longer be logged, it is not necessary, and the
error is a by-product of breaking the RDMA
connection.

An undocumented boolean configuration option
verbsRdmaEnableDeregMemBypass controlling enabling the
code that transitions the QP to state ERR instead of
recreating all memory registrations was added.  It is
enabled by default.

An undocumented boolean configuration option
verbsRdmaEnableVerboseLogging controlling verbose RDMA
logging was added.  It is disabled by default.  When
enabled, the error IBV_EVENT_QP_ACCESS_ERR is logged.

Temporary fix

Comments

APAR Information

APAR number
IJ01910
Reported component name
SPECTRUM SCALE
Reported component ID
5725Q01LX
Reported release
423
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2017-11-15
Closed date
2018-02-05
Last modified date
2018-02-05

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

IJ22052

Fix information

Fixed component name
SPECTRUM SCALE
Fixed component ID
5725Q01LX

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"423","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSFKCN","label":"General Parallel File System"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"423","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
05 February 2018

Tips

IJ01910: EXCESSIVE RDMA ERRORS LOGGED

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?