APAR status
Closed as program error.
Error description
[E] Signal 11 at location 0x559601506BCE in process 17113, link reg 0xFFFFFFFFFFFFFFFF. [I] rax 0x000000000000046B rbx 0x0000000000000000 [I] rcx 0x0000018032912EB8 rdx 0x0000000000000000 [I] rsp 0x00007F64249C0010 rbp 0x000000000001046B [I] rsi 0x00007F64249C0808 rdi 0x0000018032912EB8 [I] r8 0x00007F64211ADBE0 r9 0x0000000000000000 [I] r10 0x0000000000000000 r11 0x0000000000000000 [I] r12 0x0000018032912E10 r13 0x0000000000000000 [I] r14 0x0000559602495A60 r15 0x00007F64249C0808 [I] rip 0x0000559601506BCE eflags 0x0000000000010246 [I] csgsfs 0x002B000000000033 err 0x0000000000000004 [I] trapno 0x000000000000000E oldmsk 0x0000000010017807 [I] cr2 0x0000000000000468 [W] ------------------[GPFS Critical Thread Watchdog]------------------ [W] PID: 19652 State: S (HealthCheckThread) is stuck for more than 13 seconds [W] counter: 0 (mark-idle: 0 mark-active: 0 pre-work: 0 post-work: 0) sched: (nvcsw: 0 nivcsw: 0) [W] waiting on ThMutex 0x18032912EB8 (0xFFFFBFEF32912EB8) (ClusterConfigurationClientMutex) [W] waiting for 20.015288562 seconds [D] Traceback: [D] #0: 0x0000559601506BCE RGMaster::getNodeFullDomainName(NodeAddr, char**) + 0xAE at ??:0 [D] #1: 0x000055960150CAA2 RGMaster::rgListServers(int, unsigned int) + 0x212 at ??:0 [D] #2: 0x000055960145F21C runTSLsRecoveryGroupV2(int, StripeGroup*, int, char**) + 0xA8C at ??:0 [D] #3: 0x0000559601460371 runTSLsRecoveryGroup(int, StripeGroup*, int, char**) + 0xB1 at ??:0 [D] #4: 0x000055960106AF31 RunClientCmd(MessageHeader*, IpAddr, unsigned short, int, int, StripeGroup*, unsigned int*, RpcContext*) + 0x1851 at ??:0 [D] #5: 0x00005596013DA13D RGCManager::rgcmClientRunCmd(LocalCmdMessage*, IpAddr, unsigned short, int, int) + 0xBFD at ??:0 [D] #6: 0x000055960106CBBF HandleCmdMsg(void*) + 0x13BF at ??:0 [D] #7: 0x0000559600B38313 Thread::callBody(Thread*) + 0x63 at ??:0 [D] #8: 0x0000559600B25262 Thread::callBodyWrapper(Thread*) + 0xA2 at ??:0 [D] #9: 0x00007F6428AD52DE start_thread + 0xFE at ??:0 [D] #10: 0x00007F6427A80133 __GI___clone + 0x43 at ??:0
Local fix
Problem summary
While a node is tryiing to join a cluster, mmfsd start could encounter a null pointer reference and crash with a signal 11 with a backstack that looks like this: [D] #0: 0x0000559601506BCE RGMaster::getNode FullDomainName(NodeAddr, char**) + 0xAE at ??:0 [D] #1: 0x000055960150CAA2 RGMaster:: rgListServers(int, unsigned int) + 0x212 at ??:0 [D] #2: 0x000055960145F21C runTSLsRecoveryGroupV2 (int, StripeGroup*, int, char**) + 0xA8C at ??:0 [D] #3: 0x0000559601460371 runTSLsRecoveryGroup (int, StripeGroup*, int, char**) + 0xB1 at ??:0
Problem conclusion
Benefits of the solution: Fix will prevent daemon from crashing under such event. Work around: N/A Problem trigger: A node trying to join a cluster. Symptom: Daemon crashes with signal 11: Platforms affected: N/A Functional Area affected: GNR Customer Impact: Suggested Changed Externals: No
Temporary fix
Comments
APAR Information
APAR number
IJ28610
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
505
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-10-07
Closed date
2020-10-19
Last modified date
2020-10-19
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"505","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
07 November 2020