IBM Support

IV94666: LOGASSERTFAILED: SGDESCP->LASTEXTINODENUM == SGDESCP->NINODES

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • This assert frequently happened on quorum manager nodes
    right after GPFS expanding inode space.
    The cluster becomes unstable with mmfsd asserted several
    times per hours on all the quorum nodes.
    
    Mon Mar 27 10:29:05.901 2017: [N] Expanding <fs> inode
    space 0 current 48660480 inodes (4189282 free) by 4423680
    Mon Mar 27 10:29:05.910 2017: [X] logAssertFailed:
    sgDescP->lastExtInodeNum == sgDescP->nInodes
    Mon Mar 27 10:29:05.911 2017: [X] return code 0, reason
    code 0, log record tag 0
    Mon Mar 27 10:29:06.417 2017: [I] Accepted and connected
    to <c0n13>
    Mon Mar 27 10:29:08.297 2017: [I] Accepted and connected
    to <c0n5>
    Mon Mar 27 10:29:08.592 2017: [X] *** Assert
    exp(sgDescP->lastExtInodeNum == sgDescP->nInodes) in line
    313 of file
    /project/sprelttn2/build/rttn2s002a/src/avs/fs/mmfs/ts/fs
    /inoexp.C
    Mon Mar 27 10:29:08.593 2017: [E] *** Traceback:
    Mon Mar 27 10:29:08.594 2017: [E]
    2:0x7F3403C7D005 logAssertFailed + 0x2D5 at ??:0
    Mon Mar 27 10:29:08.595 2017: [E]
    3:0x7F3403AC8F50 SGInodeSpaceMap::expandSGnInodes(long
    long, int, unsigned int, unsigned int) + 0x580 at ??:0
    Mon Mar 27 10:29:08.596 2017: [E]
    4:0x7F3403AC9E36 SGInodeSpaceMap::Expand(int, long long,
    long long) + 0xBD6 at ??:0
    Mon Mar 27 10:29:08.597 2017: [E]
    5:0x7F3403ACA1B6 SGInodeMap::Expand(int,
    SGInodeSpaceMap*, int, long long, long long) + 0x2A6 at
    ??:0
    Mon Mar 27 10:29:08.598 2017: [E]
    6:0x7F340386B05A
    SGInodeSpaceMap::ExpandInodeFileThreadBody(void*) + 0x3A
    at ??:0
    Mon Mar 27 10:29:08.599 2017: [E]
    7:0x7F34037B5486 Thread::callBody(Thread*) + 0x46 at ??:0
    Mon Mar 27 10:29:08.600 2017: [E]
    8:0x7F34037A3162 Thread::callBodyWrapper(Thread*) + 0xA2
    at ??:0
    Mon Mar 27 10:29:08.601 2017: [E]
    9:0x7F3402E82AA1 start_thread + 0xD1 at ??:0
    Mon Mar 27 10:29:08.602 2017: [E]
    10:0x7F3402029AAD clone + 0x6D at ??:0
    mmfsd:
    /project/sprelttn2/build/rttn2s002a/src/avs/fs/mmfs/ts/fs
    /inoexp.C:313: void logAssertFailed(UInt32, const char*,
    UInt32, Int32, Int32, UInt32, const char*, const char*):
    Assertion 'sgDescP->lastExtI
    nodeNum == sgDescP->nInodes' failed.
    Mon Mar 27 10:29:08.603 2017: [E] Signal 6 at location
    0x7F3401F735E5 in process 5260, link reg
    0xFFFFFFFFFFFFFFFF.
    Mon Mar 27 10:29:08.604 2017: [I] rax
    0x0000000000000000  rbx    0x00007F34032FA000
    Mon Mar 27 10:29:08.605 2017: [I] rcx
    0xFFFFFFFFFFFFFFFF  rdx    0x0000000000000006
    Mon Mar 27 10:29:08.606 2017: [I] rsp
    0x00007F30DBCC0808  rbp    0x00007F34048B7220
    Mon Mar 27 10:29:08.607 2017: [I] rsi
    0x0000000000003362  rdi    0x000000000000148C
    Mon Mar 27 10:29:08.608 2017: [I] r8
    0xFEFEFEFEFEFEFEFF  r9     0xFEFEFEFEFF092D63
    Mon Mar 27 10:29:08.609 2017: [I] r10
    0x0000000000000008  r11    0x0000000000000202
    Mon Mar 27 10:29:08.610 2017: [I] r12
    0x00007F34048B7348  r13    0x00007F34049010E0
    Mon Mar 27 10:29:08.611 2017: [I] r14
    0x0000000000000000  r15    0x0000000000000000
    Mon Mar 27 10:29:08.612 2017: [I] rip
    0x00007F3401F735E5  eflags 0x0000000000000202
    Mon Mar 27 10:29:08.613 2017: [I] csgsfs
    0x0000000000000033  err    0x0000000000000000
    Mon Mar 27 10:29:08.614 2017: [I] trapno
    0x0000000000000000  oldmsk 0x0000000010017807
    Mon Mar 27 10:29:08.615 2017: [I] cr2
    0x0000000000000000
    Mon Mar 27 10:29:11.479 2017: [D] Traceback:
    Mon Mar 27 10:29:11.480 2017: [D] 0:00007F3401F735E5
    raise + 35 at ??:0
    Mon Mar 27 10:29:11.481 2017: [D] 1:00007F3401F74DC5
    abort + 175 at ??:0
    Mon Mar 27 10:29:11.482 2017: [D] 2:00007F3401F6C70E
    __assert_fail_base + 11E at ??:0
    Mon Mar 27 10:29:11.483 2017: [D] 3:00007F3401F6C7D0
    __assert_fail + 50 at ??:0
    Mon Mar 27 10:29:11.484 2017: [D] 4:00007F3403C7D029
    logAssertFailed + 2F9 at ??:0
    Mon Mar 27 10:29:11.485 2017: [D] 5:00007F3403AC8F50
    SGInodeSpaceMap::expandSGnInodes(long long, int, unsigned
    int, unsigned int) + 580 at ??:0
    Mon Mar 27 10:29:11.486 2017: [D] 6:00007F3403AC9E36
    SGInodeSpaceMap::Expand(int, long long, long long) + BD6
    at ??:0
    Mon Mar 27 10:29:11.487 2017: [D] 7:00007F3403ACA1B6
    SGInodeMap::Expand(int, SGInodeSpaceMap*, int, long long,
    long long) + 2A6 at ??:0
    Mon Mar 27 10:29:11.488 2017: [D] 8:00007F340386B05A
    SGInodeSpaceMap::ExpandInodeFileThreadBody(void*) + 3A at
    ??:0
    Mon Mar 27 10:29:11.489 2017: [D] 9:00007F34037B5486
    Thread::callBody(Thread*) + 46 at ??:0
    Mon Mar 27 10:29:11.490 2017: [D] 10:00007F34037A3162
    Thread::callBodyWrapper(Thread*) + A2 at ??:0
    Mon Mar 27 10:29:11.491 2017: [D] 11:00007F3402E82AA1
    start_thread + D1 at ??:0
    Mon Mar 27 10:29:11.492 2017: [D] 12:00007F3402029AAD
    clone + 6D at ??:0
    Mon Mar 27 10:29:11.493 2017: [N] Restarting mmsdrserv
    Mon Mar 27 10:29:14.920 2017: [E] Signal 6 at location
    0x7F3401FEDBDD in process 5260, link reg
    0xFFFFFFFFFFFFFFFF.
    

Local fix

  • The developer Ted Anderson has been able to reproduce the
    assert in lab
    and provide below work aroudn to customer.
    Below statements are from him:
    The key observation is that the inode values are
    different in dumps
    obtained following the assert (where inode expansion had
    begun) and
    those in dumps obtained from other nodes (showing the
    on-disk values).
    Specifically, the nInodes ("total inodes in inode file
    (with no holes)"
    57507840) value computed just before the assert was too
    high, compared
    to lastExtInodeNum ("total no of allocated inodes"
    53084160), by exactly
    the intended amount of increase (4423680).  This strongly
    indicated that
    the on-disk value had been updated "unexpectedly" in the
    first failed
    attempt and the subsequent attempts were getting confused
    when they
    tried to increase nInodes a second time.
    
    With this in mind, I added a failure-trigger to the code
    after the SG
    descriptor update (which got stuck at 9:33 on the 26th)
    and before the
    fileset update (whose value you can see in the
    mmlsfileset -L output).
    When I attempted to preallocate inodes using mmchfs
    --inode-limit X:X,
    the trigger induced a failure at that point leaving the
    fileset and SG
    descriptor values inconsistent.  Subsequent attempts to
    add inodes,
    cause the same assert the customer is seeing.
    
    The fix for the code, is to ensure we detect and handle
    this (apparently
    unusual) intermediate state and update the SG descriptor
    correctly no
    matter what happened to cause the first attempt to fail.
    
    Meanwhile, it would seem there are two approaches to
    correct the
    customer's file system.  The first is to patch the SG
    descriptor nInodes
    value back to its old value, or second, to patch the
    fileset AllocInodes
    to match the value in the SG descriptor.  In my test
    system, I took the
    first approach, which I think is a little safer.  This
    was immediately
    successful and allowed the inode expansion to complete
    without error.
    
    Specifically, I would run
    
    # mmlsfileset ASDC_archive4 -L
    # tsdbfs ASDC_archive4 desc | grep "total [in]"
    
    From the first command, I would expect AllocInodes to be
    the same
    48660480 "current" value we see in the log messages.
    From the tsdbfs
    desc output I'd look at "total no of allocated inodes",
    which should be
    48660480 + 4423680 = 53084160, and "total inodes in inode
    file (with no
    holes)", which should also be 53084160.  Then run
    
    # tsdbfs ASDC_archive4 patch desc ninodes <value from
    mmlsfileset -L / 48660480>
    
    There is a risk that before you can run these commands
    the system will
    attempt to run inode expansion automatically and crash
    again.  You
    should be able to avoid this by unmounting the SG
    everywhere first
    (mmunmount ASDC_archive4 -a).  After the patch, remount
    ASDC_archive4 if
    necessary, then try preallocating inodes to verify this
    patch was
    successful.
    
    # mmchfs ASDC_archive4 --inode-limit 134217728:50000000
    

Problem summary

  • If inode expansion is interrupted, it may
    leave nAllocatedInodes inconsistent between
    sg descriptor and fileset metadata file.
    

Problem conclusion

  • Fix is to make sgDesc value agree with the
    sum of nAllocatedInodes from all valid
    independent filesets during recovery. This
    is now done as part of SFSCompleteFilesetOps.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IV94666

  • Reported component name

    SPECTRUM SCALE

  • Reported component ID

    5725Q01LX

  • Reported release

    422

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2017-03-29

  • Closed date

    2017-11-29

  • Last modified date

    2017-11-29

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    IJ03086

Fix information

  • Fixed component name

    SPECTRUM SCALE

  • Fixed component ID

    5725Q01LX

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"422","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSFKCN","label":"General Parallel File System"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"422","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
29 November 2017