IBM Support

IJ01910: EXCESSIVE RDMA ERRORS LOGGED

 

APAR status

  • Closed as program error.

Error description

  • When client X recreates all memory registrations to
    server
    Y, this may cause RDMA errors for all the other NSD
    servers
    that have RDMA in progress to client X because other NSD
    servers are using the "old" memory registrations that are
    no longer valid.
    
    Reported in:
    Spectrum Scale 4.2.3.2 on CentOS Linux 7
    
    Known Impact:
    Deadlock
    
    Error Message such as:
    VERBS RDMA async event IBV_EVENT_QP_ACCESS_ERR on mlx5_0
    qp 0x7ffd98019a98.
    VERBS RDMA closed connection to <ip addr> (hostname) on
    mlx5_0 port 1 fabnum 0 error 733 index 10
    VERBS RDMA closed connection to <ip addr> (hostname) on
    mlx5_0 port 1 fabnum 0 error 733 index 27
    

Local fix

Problem summary

  • Excessive RDMA errors are being logged.  When an NSD
    client node attempts to send, or sends an nsdMsgReadExt
    or nsdMsgWriteExt to the NSD server that uses RDMA, the
    NSD client node tscomm layer may return an error if the
    message is not sent, or the NSD server may return an
    error.
    
    If the NSD server initiated the RDMA, and the RDMA fails,
    for example, with error IBV_WC_RETRY_EXC_ERR, the NSD
    server replies with E_RDMA.  The NSD client node
    processes the E_RDMA error by recreating the RDMA
    connection, this works fine and does not cause any RDMA
    message to be logged beyond the NSD client node and the
    NSD server node (i.e. working as designed).
    
    f the error code is E_IO or E_NOT_NSD_SERVER, we know the
    NSD server did not attempt an RDMA read or write, the NSD
    client node will not destroy the RDMA connection, no RDMA
    error is logged, the NSD client node will retry network IO
    over TCP.
    
    The NSD client node will process any other error by
    recreating all memory registrations mapping the client
    pagepool to InfiniBand.  When the NSD client node recreates
    all memory registrations, this will have a rippling effect
    throughout the cluster with respect to RDMA errors being
    logged.  When NSD client node X recreates all memory
    registrations to NSD server Y, this may cause RDMA errors
    for all the other NSD servers that have RDMA in progress to
    NSD client node X because other NSD servers are using the
    "old" memory registrations that are no longer valid.  This
    is why you see RDMA errors logged at about the same time
    in the mmfs.log to various NSD servers.
    
    The reason memory registrations are recreated for errors
    that are not E_RDMA, E_IO, and E_NOT_NSD_SERVER, and soon
    to be added E_NODEV and E_DISK_UNAVAIL, and E_MSGSIZE, is
    that we have to guarantee that the RDMA operation requested
    by the NSD client node will not be possible after the error
    is processed so that we avoid remote nodes performing RDMA
    to the pagepool when we do not expect it.  However,
    recreating all memory registrations is a brute force way
    to accomplish this and causes excessive RDMA errors to be
    logged.
    

Problem conclusion

  • Instead of recreating all memory registrations, we can
    accomplish the same if we simply transition the QP that for
    the RDMA connection to state ERR.  This will only impact
    the local node and remote node in terms of logging
    RDMA errors.
    
    As part of the fix, the error IBV_EVENT_QP_ACCESS_ERR
    will no longer be logged, it is not necessary, and the
    error is a by-product of breaking the RDMA
    connection.
    
    An undocumented boolean configuration option
    verbsRdmaEnableDeregMemBypass controlling enabling the
    code that transitions the QP to state ERR instead of
    recreating all memory registrations was added.  It is
    enabled by default.
    
    An undocumented boolean configuration option
    verbsRdmaEnableVerboseLogging controlling verbose RDMA
    logging was added.  It is disabled by default.  When
    enabled, the error IBV_EVENT_QP_ACCESS_ERR is logged.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ01910

  • Reported component name

    SPECTRUM SCALE

  • Reported component ID

    5725Q01LX

  • Reported release

    423

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2017-11-15

  • Closed date

    2018-02-05

  • Last modified date

    2018-02-05

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    IJ22052

Fix information

  • Fixed component name

    SPECTRUM SCALE

  • Fixed component ID

    5725Q01LX

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"423","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSFKCN","label":"General Parallel File System"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"423","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
05 February 2018