Topic
  • 5 replies
  • Latest Post - ‏2012-08-16T19:27:08Z by vpaul
wjnadia
wjnadia
4 Posts

Pinned topic A lot of Long waiters

‏2012-08-13T17:59:21Z |
Recently, I upgraded GPFS from 3.4.0-* to 3.5.0-2.
After the upgrade, I had a problem.
If I ran a GPFS commands related filesystem such as mmlsfs, mmdf, mmlsdisk,mmlsmount and so on, I couldn't see it's result permanently.
Though I had this problem, I could normally access my filesystems(I tested IO with iozone).
At that time, I found a lot of long waiters from "mmdiag --waiters" messages on gpfs servers as below;

I attached the trace report file also.

Please give me a comment.
Thanks in advance.

root@pgfs03 ~# mmdiag --waiters

=== mmdiag: waiters ===
0x11F69450 waiting 154.066201000 seconds, CommandMsgHandlerThread: on ThCond 0x11B8B298 (0x11B8B298) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11F67940 waiting 207.235842000 seconds, CommandMsgHandlerThread: on ThCond 0x11F68E38 (0x11F68E38) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11F65E30 waiting 225.726267000 seconds, CommandMsgHandlerThread: on ThCond 0x11F42288 (0x11F42288) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11F649A0 waiting 297.386174000 seconds, CommandMsgHandlerThread: on ThCond 0x11BC7C38 (0x11BC7C38) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11F63510 waiting 317.415957000 seconds, CommandMsgHandlerThread: on ThCond 0x11C17D18 (0x11C17D18) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11B95EE0 waiting 337.443457000 seconds, CommandMsgHandlerThread: on ThCond 0x11B96BA8 (0x11B96BA8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11B94B90 waiting 344.940083000 seconds, CommandMsgHandlerThread: on ThCond 0x11C29398 (0x11C29398) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11F6C400 waiting 377.510938000 seconds, CommandMsgHandlerThread: on ThCond 0x11F207E8 (0x11F207E8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11F6BFD0 waiting 397.536584000 seconds, CommandMsgHandlerThread: on ThCond 0x11E24848 (0x11E24848) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11F6AAC0 waiting 457.569541000 seconds, CommandMsgHandlerThread: on ThCond 0x11BFF4E8 (0x11BFF4E8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11F46FC0 waiting 477.591149000 seconds, CommandMsgHandlerThread: on ThCond 0x11BFEE68 (0x11BFEE68) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11E71D90 waiting 497.626887000 seconds, CommandMsgHandlerThread: on ThCond 0x2AAAAC06FE08 (0x2AAAAC06FE08) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11E70740 waiting 517.653430000 seconds, CommandMsgHandlerThread: on ThCond 0x11F1E038 (0x11F1E038) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11E6F2B0 waiting 534.134402000 seconds, CommandMsgHandlerThread: on ThCond 0x11F1E9D8 (0x11F1E9D8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11E6DE20 waiting 537.683927000 seconds, CommandMsgHandlerThread: on ThCond 0x2AAAAC06EA88 (0x2AAAAC06EA88) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11E6C990 waiting 557.716852000 seconds, CommandMsgHandlerThread: on ThCond 0x11BF32A8 (0x11BF32A8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11E6B500 waiting 565.565454000 seconds, CommandMsgHandlerThread: on ThCond 0x2AAAAC1D90A8 (0x2AAAAC1D90A8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11F43BF0 waiting 577.746488000 seconds, CommandMsgHandlerThread: on ThCond 0x11BF25A8 (0x11BF25A8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11F428A0 waiting 597.774051000 seconds, CommandMsgHandlerThread: on ThCond 0x11BF1B38 (0x11BF1B38) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11F3EEF0 waiting 677.825917000 seconds, CommandMsgHandlerThread: on ThCond 0x11BCA4B8 (0x11BCA4B8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11F3EAC0 waiting 657.804716000 seconds, CommandMsgHandlerThread: on ThCond 0x11C11308 (0x11C11308) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11F1B6E0 waiting 754.457596000 seconds, CommandMsgHandlerThread: on ThCond 0x11E72348 (0x11E72348) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11F18C40 waiting 873.207752000 seconds, CommandMsgHandlerThread: on ThCond 0x11C076E8 (0x11C076E8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11EF7400 waiting 893.574536000 seconds, CommandMsgHandlerThread: on ThCond 0x11C06268 (0x11C06268) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x2AAAB4275A30 waiting 945.294371000 seconds, CommandMsgHandlerThread: on ThCond 0x11BFD228 (0x11BFD228) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11F092C0 waiting 1200.773254000 seconds, CommandMsgHandlerThread: on ThCond 0x11C10C88 (0x11C10C88) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11F1A250 waiting 1233.938454000 seconds, CommandMsgHandlerThread: on ThCond 0x11BC9E38 (0x11BC9E38) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11F20EC0 waiting 1545.437977000 seconds, CommandMsgHandlerThread: on ThCond 0x2AAAAC06F788 (0x2AAAAC06F788) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11F20D20 waiting 1354.421550000 seconds, CommandMsgHandlerThread: on ThCond 0x11BBC9A8 (0x11BBC9A8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x2AAAB02DD3E0 waiting 1553.008087000 seconds, CommandMsgHandlerThread: on ThCond 0x2AAAAC06F108 (0x2AAAAC06F108) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x2AAAB0316E60 waiting 1530.983588000 seconds, CommandMsgHandlerThread: on ThCond 0x11BFC528 (0x11BFC528) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11F388B0 waiting 1866.900324000 seconds, CommandMsgHandlerThread: on ThCond 0x11BBD6A8 (0x11BBD6A8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11E22030 waiting 1887.672970000 seconds, CommandMsgHandlerThread: on ThCond 0x11F37C18 (0x11F37C18) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x2AAAB031E5C0 waiting 1954.511265000 seconds, CommandMsgHandlerThread: on ThCond 0x11BFFB68 (0x11BFFB68) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11E59500 waiting 2145.294488000 seconds, CommandMsgHandlerThread: on ThCond 0x11BF2C28 (0x11BF2C28) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11E57FC0 waiting 2199.833393000 seconds, CommandMsgHandlerThread: on ThCond 0x11BEF478 (0x11BEF478) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11F2E730 waiting 2230.001788000 seconds, CommandMsgHandlerThread: on ThCond 0x11C25418 (0x11C25418) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11F34CC0 waiting 2534.493363000 seconds, CommandMsgHandlerThread: on ThCond 0x11C0FF88 (0x11C0FF88) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11F349E0 waiting 2554.475242000 seconds, CommandMsgHandlerThread: on ThCond 0x11BCB1B8 (0x11BCB1B8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11E62E80 waiting 3254.335517000 seconds, CommandMsgHandlerThread: on ThCond 0x11C08A68 (0x11C08A68) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x11F16AE0 waiting 2896.972635000 seconds, CommandMsgHandlerThread: on ThCond 0x11C172A8 (0x11C172A8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11EED630 waiting 2869.960157000 seconds, CommandMsgHandlerThread: on ThCond 0x11BC3858 (0x11BC3858) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x2AAAB002AC40 waiting 3284.577555000 seconds, CommandMsgHandlerThread: on ThCond 0x11C083E8 (0x11C083E8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x2AAAAC0C41C0 waiting 3718.009912000 seconds, FsckClientReaperThread: on ThCond 0x11E29B58 (0x11E29B58) (FsckReaperCondvar), reason 'Waiting to reap fsck pointer'
0x2AAAAC0C0A20 waiting 3154.304491000 seconds, CommandMsgHandlerThread: on ThCond 0x11BBD028 (0x11BBD028) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
0x2AAAAC0C0730 waiting 2567.803986000 seconds, CommandMsgHandlerThread: on ThCond 0x11BCAB38 (0x11BCAB38) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x2AAAAC0BCD40 waiting 3718.009838000 seconds, FsckStaticThread: on ThCond 0x11E29BB0 (0x11E29BB0) (FsckStaticThreadCondvar), reason 'Waiting for static fsck work'
0x2AAAAC0A0330 waiting 2745.465012000 seconds, CommandMsgHandlerThread: on ThCond 0x11BC75B8 (0x11BC75B8) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.112 <c0n2>
0x11C04ED0 waiting 357.482287000 seconds, CommandMsgHandlerThread: on ThCond 0x11BC5878 (0x11BC5878) (MsgRecordCondvar), reason 'RPC wait' for sgmMsgSGClientCmd on node 134.75.117.111 <c0n1>
Updated on 2012-08-16T19:27:08Z at 2012-08-16T19:27:08Z by vpaul
  • vpaul
    vpaul
    79 Posts

    Re: A lot of Long waiters

    ‏2012-08-13T19:02:19Z  
    To better understand this distributed deadlock, please capture the following output:

    mmdsh -N all /usr/lpp/mmfs/bin/mmdiag --waiters

    Thanks.
  • wjnadia
    wjnadia
    4 Posts

    Re: A lot of Long waiters

    ‏2012-08-14T00:13:29Z  
    • vpaul
    • ‏2012-08-13T19:02:19Z
    To better understand this distributed deadlock, please capture the following output:

    mmdsh -N all /usr/lpp/mmfs/bin/mmdiag --waiters

    Thanks.
    I attached the output file generated by "pdsh -w pgfs01,03,13 'mmdiag --waiters'".
  • vpaul
    vpaul
    79 Posts

    Re: A lot of Long waiters

    ‏2012-08-14T18:53:06Z  
    • wjnadia
    • ‏2012-08-14T00:13:29Z
    I attached the output file generated by "pdsh -w pgfs01,03,13 'mmdiag --waiters'".
    What OS is this on? What are the minReleaseLevel and filesystem versions? You can get these by running "mmlsconfig" and "mmlsfs $fsname -V".

    There are thousands of cascaded waiters all the way up to 6-7 hours. Is it possible for you to reboot this cluster and check for waiters the 1st time they appear, and then capture the snap data? If the waiters do appear, it will be best to open a service PMR.

    Thanks.
  • Tucks
    Tucks
    78 Posts

    Re: A lot of Long waiters

    ‏2012-08-16T14:35:25Z  
    • vpaul
    • ‏2012-08-13T19:02:19Z
    To better understand this distributed deadlock, please capture the following output:

    mmdsh -N all /usr/lpp/mmfs/bin/mmdiag --waiters

    Thanks.
    Is it possible to correlate a waiter to a file?
  • vpaul
    vpaul
    79 Posts

    Re: A lot of Long waiters

    ‏2012-08-16T19:27:08Z  
    • Tucks
    • ‏2012-08-16T14:35:25Z
    Is it possible to correlate a waiter to a file?
    Hello,

    Though it may be possible to correlate waiters to a particular file, the process is complex and cannot be done via this forum. Please contact IBM service so that all necessary data can be collected for troubleshooting this issue.

    Thanks.