IBM Support

IJ28610: SIGNAL 11 RGMASTER::GETNODEFULLDOMAINNAME

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • [E] Signal 11 at location 0x559601506BCE in process
    17113, link reg 0xFFFFFFFFFFFFFFFF.
    [I] rax    0x000000000000046B  rbx    0x0000000000000000
    [I] rcx    0x0000018032912EB8  rdx    0x0000000000000000
    [I] rsp    0x00007F64249C0010  rbp    0x000000000001046B
    [I] rsi    0x00007F64249C0808  rdi    0x0000018032912EB8
    [I] r8     0x00007F64211ADBE0  r9     0x0000000000000000
    [I] r10    0x0000000000000000  r11    0x0000000000000000
    [I] r12    0x0000018032912E10  r13    0x0000000000000000
    [I] r14    0x0000559602495A60  r15    0x00007F64249C0808
    [I] rip    0x0000559601506BCE  eflags 0x0000000000010246
    [I] csgsfs 0x002B000000000033  err    0x0000000000000004
    [I] trapno 0x000000000000000E  oldmsk 0x0000000010017807
    [I] cr2    0x0000000000000468
    [W] ------------------[GPFS Critical Thread
    Watchdog]------------------
    [W] PID: 19652 State: S (HealthCheckThread) is stuck for
    more than 13 seconds
    [W]  counter: 0 (mark-idle: 0 mark-active: 0 pre-work: 0
    post-work: 0) sched: (nvcsw: 0 nivcsw: 0)
    [W]  waiting on ThMutex 0x18032912EB8
    (0xFFFFBFEF32912EB8) (ClusterConfigurationClientMutex)
    [W]  waiting for 20.015288562 seconds
    [D] Traceback:
    [D] #0: 0x0000559601506BCE
    RGMaster::getNodeFullDomainName(NodeAddr, char**) + 0xAE
    at ??:0
    [D] #1: 0x000055960150CAA2 RGMaster::rgListServers(int,
    unsigned int) + 0x212 at ??:0
    [D] #2: 0x000055960145F21C runTSLsRecoveryGroupV2(int,
    StripeGroup*, int, char**) + 0xA8C at ??:0
    [D] #3: 0x0000559601460371 runTSLsRecoveryGroup(int,
    StripeGroup*, int, char**) + 0xB1 at ??:0
    [D] #4: 0x000055960106AF31 RunClientCmd(MessageHeader*,
    IpAddr, unsigned short, int, int, StripeGroup*, unsigned
    int*, RpcContext*) + 0x1851 at ??:0
    [D] #5: 0x00005596013DA13D
    RGCManager::rgcmClientRunCmd(LocalCmdMessage*, IpAddr,
    unsigned short, int, int) + 0xBFD at ??:0
    [D] #6: 0x000055960106CBBF HandleCmdMsg(void*) + 0x13BF
    at ??:0
    [D] #7: 0x0000559600B38313 Thread::callBody(Thread*) +
    0x63 at ??:0
    [D] #8: 0x0000559600B25262
    Thread::callBodyWrapper(Thread*) + 0xA2 at ??:0
    [D] #9: 0x00007F6428AD52DE start_thread + 0xFE at ??:0
    [D] #10: 0x00007F6427A80133 __GI___clone + 0x43 at ??:0
    

Local fix

Problem summary

  • While a node is tryiing to join a cluster,
    mmfsd start could encounter a null
    pointer reference and crash with a
    signal 11 with a backstack that looks like this:
    [D] #0: 0x0000559601506BCE RGMaster::getNode
    FullDomainName(NodeAddr, char**) + 0xAE at ??:0
    [D] #1: 0x000055960150CAA2 RGMaster::
    rgListServers(int, unsigned int) + 0x212 at ??:0
    [D] #2: 0x000055960145F21C runTSLsRecoveryGroupV2
    (int, StripeGroup*, int, char**) + 0xA8C at ??:0
    [D] #3: 0x0000559601460371 runTSLsRecoveryGroup
    (int, StripeGroup*, int, char**) + 0xB1 at ??:0
    

Problem conclusion

  • Benefits of the solution:
    Fix will prevent daemon from crashing under such event.
    Work around:
    N/A
    Problem trigger:
    A node trying to join a cluster.
    Symptom:
    Daemon crashes with signal 11:
    
    Platforms affected:
    N/A
    Functional Area affected:
    GNR
    Customer Impact:
    Suggested
    Changed Externals:
    No
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ28610

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    505

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2020-10-07

  • Closed date

    2020-10-19

  • Last modified date

    2020-10-19

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    IJ29155

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"505","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
07 November 2020