Topic
  • 10 replies
  • Latest Post - ‏2012-12-14T14:46:58Z by dlmcnabb
chr78
chr78
132 Posts

Pinned topic assert in 3.4.0.17

‏2012-12-11T15:13:47Z |
Is this a known assert ? thanks.

Tue Dec 11 16:10:16.245 2012: logAssertFailed: actualSubblocks > 0
Tue Dec 11 16:10:16.246 2012: return code 0, reason code 0, log record tag 0
Tue Dec 11 16:10:17.606 2012: *** Assert exp(actualSubblocks > 0) in line 423 of file /project/sprelhin/build/rhins017a/src/avs/fs/mmfs/ts/classes/basic/suballoc.C
Tue Dec 11 16:10:17.607 2012: *** Traceback:
Tue Dec 11 16:10:17.611 2012: 2:0x849883 logAssertFailed.8496D0 + 0x1B3
Tue Dec 11 16:10:17.612 2012: 3:0xF1ED58 BaseSuballocator::subAlloc(int*, unsigned int, SuballocData*, int*, int*).F1E7F0 + 0x568
Tue Dec 11 16:10:17.611 2012: 4:0x5AF5D8 AllocSegment::alloc(int, int*, unsigned int, int*, long long*).5AF230 + 0x3A8
Tue Dec 11 16:10:17.612 2012: 5:0x6363C0 AllocRegion::alloc(int, int*, unsigned int, DiskAddr96*, long long*, unsigned int*).636210 + 0x1B0
Tue Dec 11 16:10:17.611 2012: 6:0x5A1BC7 SGAllocMap::allocLocal(unsigned int, int, int*, unsigned int, unsigned int, unsigned int, DiskAddr96*, long long*).5A0F10 + 0xCB7
Tue Dec 11 16:10:17.612 2012: 7:0x5A3080 SGAllocMap::allocLocalReplica(SGDescCache*, unsigned int, int, fsDiskAddr const&, int, int*, unsigned int, unsigned int, DiskAddr96*, long long*, AllocCursor**).5A2A30 + 0x650
Tue Dec 11 16:10:17.611 2012: 8:0x5A5BA5 SGAllocMap::allocRemote(SGDescCache*, AMTime const&, unsigned int, int, AllocMask, int, int, unsigned int, unsigned int, fsDiskAddr const&, DiskAddr96*, int, AllocBatchInfo*).5A5870 + 0x335
Tue Dec 11 16:10:17.612 2012: 9:0x7B0484 SGAllocSet::allocRemote(int, AMTime const&, unsigned int, int, AllocMask, int, int, unsigned int, unsigned int, fsDiskAddr const&, DiskAddr96*, int, AllocBatchInfo*).7B0190 + 0x2F4
Tue Dec 11 16:10:17.611 2012: 10:0x7B4802 AllocClientHandleMsg(RpcContext*, char*).7B4000 + 0x802
Tue Dec 11 16:10:17.612 2012: 11:0x85A22E ServiceRegister::handleMsg(RpcContext*, char*, Errno&).85A140 + 0xEE
Tue Dec 11 16:10:17.611 2012: 12:0x8566DF tscHandleMsg(RpcContext*, MsgDataBuf*).856280 + 0x45F
Tue Dec 11 16:10:17.612 2012: 13:0x86540A RcvWorker::RcvMain().865270 + 0x19A
Tue Dec 11 16:10:17.613 2012: 14:0x865480 RcvWorker::thread(void*).865440 + 0x40
Tue Dec 11 16:10:17.612 2012: 15:0x54D653 Thread::callBody(Thread*).54D540 + 0x113
Tue Dec 11 16:10:17.613 2012: 16:0x54502D Thread::callBodyWrapper(Thread*).544FA0 + 0x8D
Tue Dec 11 16:10:17.612 2012: 17:0x2B732082F7B6 start_thread + 0xE6
Tue Dec 11 16:10:17.613 2012: 18:0x2B732127CC6D clone + 0x6D
mmfsd: /project/sprelhin/build/rhins017a/src/avs/fs/mmfs/ts/classes/basic/suballoc.C:423: void logAssertFailed(unsigned int, const char*, unsigned int, int, int, unsigned int, const char*, const char*): Assertion `actualSubblocks > 0' failed.
Tue Dec 11 16:10:17.612 2012: Signal 6 at location 0x2B73211D7B55 in process 5528, link reg 0xFFFFFFFFFFFFFFFF.
Tue Dec 11 16:10:17.613 2012: rax 0x0000000000000000 rbx 0x00002B73212DC5E0
Tue Dec 11 16:10:17.612 2012: rcx 0xFFFFFFFFFFFFFFFF rdx 0x0000000000000006
Tue Dec 11 16:10:17.613 2012: rsp 0x00002B732B3F10E8 rbp 0x0000000000F8C9C0
Tue Dec 11 16:10:17.612 2012: rsi 0x0000000000001A29 rdi 0x0000000000001598
Tue Dec 11 16:10:17.613 2012: r8 0x00000000FFFFFFFF r9 0x00002B7321514E40
Tue Dec 11 16:10:17.612 2012: r10 0x0000000000000008 r11 0x0000000000000202
Tue Dec 11 16:10:17.613 2012: r12 0x00007FFFE93B6E6D r13 0x00002B73212DC5E0
Tue Dec 11 16:10:17.612 2012: r14 0x00000000010153C4 r15 0x00000000000001A7
Tue Dec 11 16:10:17.613 2012: rip 0x00002B73211D7B55 eflags 0x0000000000000202
Tue Dec 11 16:10:17.612 2012: csgsfs 0x0000000000000033 err 0x0000000000000000
Tue Dec 11 16:10:17.613 2012: trapno 0x0000000000000000 oldmsk 0x0000000010017807
Tue Dec 11 16:10:17.612 2012: cr2 0x0000000000000000
Tue Dec 11 16:10:18.838 2012: Traceback:
Tue Dec 11 16:10:18.850 2012: 0:00002B73211D7B55 raise.2B73211D7B20 + 35
Tue Dec 11 16:10:18.851 2012: 1:00002B73211D9131 abort + 181
Tue Dec 11 16:10:18.850 2012: 2:00002B73211D0A10 __assert_fail.2B73211D0920 + F0
Tue Dec 11 16:10:18.851 2012: 3:0000000000849867 logAssertFailed.8496D0 + 197
Tue Dec 11 16:10:18.850 2012: 4:0000000000F1ED58 BaseSuballocator::subAlloc(int*, unsigned int, SuballocData*, int*, int*).F1E7F0 + 568
Tue Dec 11 16:10:18.851 2012: 5:00000000005AF5D8 AllocSegment::alloc(int, int*, unsigned int, int*, long long*).5AF230 + 3A8
Tue Dec 11 16:10:18.850 2012: 6:00000000006363C0 AllocRegion::alloc(int, int*, unsigned int, DiskAddr96*, long long*, unsigned int*).636210 + 1B0
Tue Dec 11 16:10:18.851 2012: 7:00000000005A1BC7 SGAllocMap::allocLocal(unsigned int, int, int*, unsigned int, unsigned int, unsigned int, DiskAddr96*, long long*).5A0F10 + CB7
Tue Dec 11 16:10:18.850 2012: 8:00000000005A3080 SGAllocMap::allocLocalReplica(SGDescCache*, unsigned int, int, fsDiskAddr const&, int, int*, unsigned int, unsigned int, DiskAddr96*, long long*, AllocCursor**).5A2A30 + 650
Tue Dec 11 16:10:18.851 2012: 9:00000000005A5BA5 SGAllocMap::allocRemote(SGDescCache*, AMTime const&, unsigned int, int, AllocMask, int, int, unsigned int, unsigned int, fsDiskAddr const&, DiskAddr96*, int, AllocBatchInfo*).5A5870 + 335
Tue Dec 11 16:10:18.850 2012: 10:00000000007B0484 SGAllocSet::allocRemote(int, AMTime const&, unsigned int, int, AllocMask, int, int, unsigned int, unsigned int, fsDiskAddr const&, DiskAddr96*, int, AllocBatchInfo*).7B0190 + 2F4
Tue Dec 11 16:10:18.851 2012: 11:00000000007B4802 AllocClientHandleMsg(RpcContext*, char*).7B4000 + 802
Tue Dec 11 16:10:18.852 2012: 12:000000000085A22E ServiceRegister::handleMsg(RpcContext*, char*, Errno&).85A140 + EE
Tue Dec 11 16:10:18.851 2012: 13:00000000008566DF tscHandleMsg(RpcContext*, MsgDataBuf*).856280 + 45F
Tue Dec 11 16:10:18.852 2012: 14:000000000086540A RcvWorker::RcvMain().865270 + 19A
Tue Dec 11 16:10:18.851 2012: 15:0000000000865480 RcvWorker::thread(void*).865440 + 40
Tue Dec 11 16:10:18.852 2012: 16:000000000054D653 Thread::callBody(Thread*).54D540 + 113
Tue Dec 11 16:10:18.851 2012: 17:000000000054502D Thread::callBodyWrapper(Thread*).544FA0 + 8D
Tue Dec 11 16:10:18.852 2012: 18:00002B732082F7B6 start_thread + E6
Tue Dec 11 16:10:18.897 2012: Directory /tmp/mmfs does not exist. Not creating internaldump
Tue Dec 11 16:10:18.940 2012: Signal 6 at location 0x2B732124AE0D in process 5528, link reg 0xFFFFFFFFFFFFFFFF.
Tue Dec 11 16:10:18.941 2012: mmfsd is shutting down.
Tue Dec 11 16:10:18.942 2012: Reason for shutdown: Signal handler entered
Tue Dec 11 16:10:19 CET 2012: mmcommon mmfsdown invoked. Subsystem: mmfs Status: active
Updated on 2012-12-14T14:46:58Z at 2012-12-14T14:46:58Z by dlmcnabb
  • chr78
    chr78
    132 Posts

    Re: assert in 3.4.0.17

    ‏2012-12-11T17:12:58Z  
    additional info - i had seen plenty of these this afternoon. Could this be connected to
    a remote filesystem which got quite (really) full ?
  • sberman
    sberman
    61 Posts

    Re: assert in 3.4.0.17

    ‏2012-12-13T14:30:32Z  
    • chr78
    • ‏2012-12-11T17:12:58Z
    additional info - i had seen plenty of these this afternoon. Could this be connected to
    a remote filesystem which got quite (really) full ?
    I posed your question to the developers who most recently worked on this assert and here is their answer (edited for readability):

    I think this is the same problem fixed in APAR IV22131, PMR 62437.999.866 that's included in GPFS 3.4.0.15+. If I remember correctly, the problem is with msg handler on the home cluster that owns the filesystem. One needs to make sure all nodes on the home cluster are running with this fix. The releases where it is included are GPFS 3.4.0.15 and all of GPFS 3.5
  • chr78
    chr78
    132 Posts

    Re: assert in 3.4.0.17

    ‏2012-12-13T15:35:03Z  
    • sberman
    • ‏2012-12-13T14:30:32Z
    I posed your question to the developers who most recently worked on this assert and here is their answer (edited for readability):

    I think this is the same problem fixed in APAR IV22131, PMR 62437.999.866 that's included in GPFS 3.4.0.15+. If I remember correctly, the problem is with msg handler on the home cluster that owns the filesystem. One needs to make sure all nodes on the home cluster are running with this fix. The releases where it is included are GPFS 3.4.0.15 and all of GPFS 3.5
    thanks, Steven! there are indeed two nodes left in the home cluster running 3.4.0.14 efix4

    cheers.
  • VincenzoVagnoni
    VincenzoVagnoni
    112 Posts

    Re: assert in 3.4.0.17

    ‏2012-12-13T21:56:22Z  
    I am getting this assert as well. Schematically, the setup is the following. We have three clusters, say A, B and C. Clusters B and C are mounting remotely a filesystem owned by cluster A. Cluster B has GPFS 3.4.0-3, whereas cluster C has 3.4.0-17. Do I interpret correctly that the assert should disappear if we upgrade all the nodes in cluster B to 3.4.0-17 as well? Thanks
  • sxiao
    sxiao
    36 Posts

    Re: assert in 3.4.0.17

    ‏2012-12-13T22:14:58Z  
    I am getting this assert as well. Schematically, the setup is the following. We have three clusters, say A, B and C. Clusters B and C are mounting remotely a filesystem owned by cluster A. Cluster B has GPFS 3.4.0-3, whereas cluster C has 3.4.0-17. Do I interpret correctly that the assert should disappear if we upgrade all the nodes in cluster B to 3.4.0-17 as well? Thanks
    You need to upgrade all the nodes in cluster A to 3.4.0-15 or later.
  • VincenzoVagnoni
    VincenzoVagnoni
    112 Posts

    Re: assert in 3.4.0.17

    ‏2012-12-13T22:25:20Z  
    unfortunately this is not easy, as we have not only cluster A, but A1, A2, A3, A4, ... so upgrading all the clusters is a major issue. By they way, looking at the change log of 3.4.0-15 I read:

    Fix allocation message handler to prevent a GPFS daemon assert. The assert could happen when a filesystem is been used by more than 1 remote cluster.

    Does it mean that the assert disappears if we have just one remote cluster mounting the filesystem?
  • sxiao
    sxiao
    36 Posts

    Re: assert in 3.4.0.17

    ‏2012-12-13T23:19:59Z  
    unfortunately this is not easy, as we have not only cluster A, but A1, A2, A3, A4, ... so upgrading all the clusters is a major issue. By they way, looking at the change log of 3.4.0-15 I read:

    Fix allocation message handler to prevent a GPFS daemon assert. The assert could happen when a filesystem is been used by more than 1 remote cluster.

    Does it mean that the assert disappears if we have just one remote cluster mounting the filesystem?
    Yes. The assert will disappear with just 1 remote cluster mounting the filesystem.
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: assert in 3.4.0.17

    ‏2012-12-14T00:44:52Z  
    • sxiao
    • ‏2012-12-13T22:14:58Z
    You need to upgrade all the nodes in cluster A to 3.4.0-15 or later.
    You only need to upgrade the nodes designated as "manager" nodes in a cluster that owns filesystems. These are the only nodes that handle the block allocation RPC relay between nodes in two remote clusters.

    Below 3.4.0.15 the relay was not passing on the number of subblocks to allocate, so the node that finally received the allocation request would get a garbage value for nSubblocks and could allocate different amounts of space than the requesting node wanted. This can lead to filesystem corruption. So it would be wise to install this fix as soon as possible.
  • VincenzoVagnoni
    VincenzoVagnoni
    112 Posts

    Re: assert in 3.4.0.17

    ‏2012-12-14T07:07:59Z  
    this looks very scaring. We have some 10 clusters, hundred managers, 20 filesystems and tens of PB of storage. Furthermore, do you mean that once we upgrade it would be safe to run an offline fsck? This would be a disaster. I'm wondering why it is not rubber stamped in capital letters somewhere that this bug can lead to fs corruption (maybe I missed it?). Please tell me something more reassuring....
  • dlmcnabb
    dlmcnabb
    1012 Posts

    Re: assert in 3.4.0.17

    ‏2012-12-14T14:46:58Z  
    this looks very scaring. We have some 10 clusters, hundred managers, 20 filesystems and tens of PB of storage. Furthermore, do you mean that once we upgrade it would be safe to run an offline fsck? This would be a disaster. I'm wondering why it is not rubber stamped in capital letters somewhere that this bug can lead to fs corruption (maybe I missed it?). Please tell me something more reassuring....
    The problem is very rare. It happens when you get a filesystem nearly full and nodes have to ask other nodes to do allocations an their behalf. You should have already noticed a slowdown in performance when this occurs.