Topic
  • 2 replies
  • Latest Post - ‏2012-11-26T23:20:02Z by SystemAdmin
SystemAdmin
SystemAdmin
2092 Posts

Pinned topic waiting for RDMA XXXX

‏2012-11-26T15:07:32Z |
Hi,

we faced issue about # of RDMA.
One of NSD servers has issue like
  • waiting for RDMA read DTO completion
  • waiting for RDMA write DTO completion
  • waiting for conn rdmas < conn maxrdmas

Attached is the output of mmfsadm dump waiters when we decided to shutdown GPFS on that NSD server.
Does anyone know the means of above messages? especially "read DTO completeion", "write DTO completion"?

This FS worked fine after we shutdown GPFS on that NSD servers.
Is this imply any issues on IB fablic??

Many thanks,
Updated on 2012-11-26T23:20:02Z at 2012-11-26T23:20:02Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: waiting for RDMA XXXX

    ‏2012-11-26T17:39:51Z  
    (not an IBMer)

    RDMA DTO completion means = RDMA Data Transfer Operation completion.

    This seems to be the key line:

    0x7F0C90003450 waiting 59291.928042724 seconds, Msg handler ccMsgGroupLeave: on ThCond 0x1253ED0 (0x1253ED0) (NsdServerCondVar), reason 'waiting for NSD active I/O queue to empty'

    I'm thinking you had a storage subsystem problem, so read/writes to that NSD server were all hung. If you have hanging IO (to disk), than the NSD server won't process reads/writes to/from disk, so all of the RDMAs sent from clients will hang, and never get processed, hence the "reason 'waiting for conn rdmas < conn maxrdmas" waiters. I'd bet if you ran mmfsadm dump verbs, you'd see you'd be at the verbsRdmasPerConnection limit...

    Just my .02...
  • SystemAdmin
    SystemAdmin
    2092 Posts

    Re: waiting for RDMA XXXX

    ‏2012-11-26T23:20:02Z  
    (not an IBMer)

    RDMA DTO completion means = RDMA Data Transfer Operation completion.

    This seems to be the key line:

    0x7F0C90003450 waiting 59291.928042724 seconds, Msg handler ccMsgGroupLeave: on ThCond 0x1253ED0 (0x1253ED0) (NsdServerCondVar), reason 'waiting for NSD active I/O queue to empty'

    I'm thinking you had a storage subsystem problem, so read/writes to that NSD server were all hung. If you have hanging IO (to disk), than the NSD server won't process reads/writes to/from disk, so all of the RDMAs sent from clients will hang, and never get processed, hence the "reason 'waiting for conn rdmas < conn maxrdmas" waiters. I'd bet if you ran mmfsadm dump verbs, you'd see you'd be at the verbsRdmasPerConnection limit...

    Just my .02...
    Hi,
    Thank you so much for very helpful infomation.
    Unfortunately, I didn't get mmfsadm dump verbs.
    #I didn't know about this option

    it looks like there haven't been issue at storage, but Will check again.
    And we will try mmfsadm dump verbs if we face same issue.

    Many thanks,