IJ28610: SIGNAL 11 RGMASTER::GETNODEFULLDOMAINNAME

APAR status

Closed as program error.

Error description

[E] Signal 11 at location 0x559601506BCE in process
17113, link reg 0xFFFFFFFFFFFFFFFF.
[I] rax    0x000000000000046B  rbx    0x0000000000000000
[I] rcx    0x0000018032912EB8  rdx    0x0000000000000000
[I] rsp    0x00007F64249C0010  rbp    0x000000000001046B
[I] rsi    0x00007F64249C0808  rdi    0x0000018032912EB8
[I] r8     0x00007F64211ADBE0  r9     0x0000000000000000
[I] r10    0x0000000000000000  r11    0x0000000000000000
[I] r12    0x0000018032912E10  r13    0x0000000000000000
[I] r14    0x0000559602495A60  r15    0x00007F64249C0808
[I] rip    0x0000559601506BCE  eflags 0x0000000000010246
[I] csgsfs 0x002B000000000033  err    0x0000000000000004
[I] trapno 0x000000000000000E  oldmsk 0x0000000010017807
[I] cr2    0x0000000000000468
[W] ------------------[GPFS Critical Thread
Watchdog]------------------
[W] PID: 19652 State: S (HealthCheckThread) is stuck for
more than 13 seconds
[W]  counter: 0 (mark-idle: 0 mark-active: 0 pre-work: 0
post-work: 0) sched: (nvcsw: 0 nivcsw: 0)
[W]  waiting on ThMutex 0x18032912EB8
(0xFFFFBFEF32912EB8) (ClusterConfigurationClientMutex)
[W]  waiting for 20.015288562 seconds
[D] Traceback:
[D] #0: 0x0000559601506BCE
RGMaster::getNodeFullDomainName(NodeAddr, char**) + 0xAE
at ??:0
[D] #1: 0x000055960150CAA2 RGMaster::rgListServers(int,
unsigned int) + 0x212 at ??:0
[D] #2: 0x000055960145F21C runTSLsRecoveryGroupV2(int,
StripeGroup*, int, char**) + 0xA8C at ??:0
[D] #3: 0x0000559601460371 runTSLsRecoveryGroup(int,
StripeGroup*, int, char**) + 0xB1 at ??:0
[D] #4: 0x000055960106AF31 RunClientCmd(MessageHeader*,
IpAddr, unsigned short, int, int, StripeGroup*, unsigned
int*, RpcContext*) + 0x1851 at ??:0
[D] #5: 0x00005596013DA13D
RGCManager::rgcmClientRunCmd(LocalCmdMessage*, IpAddr,
unsigned short, int, int) + 0xBFD at ??:0
[D] #6: 0x000055960106CBBF HandleCmdMsg(void*) + 0x13BF
at ??:0
[D] #7: 0x0000559600B38313 Thread::callBody(Thread*) +
0x63 at ??:0
[D] #8: 0x0000559600B25262
Thread::callBodyWrapper(Thread*) + 0xA2 at ??:0
[D] #9: 0x00007F6428AD52DE start_thread + 0xFE at ??:0
[D] #10: 0x00007F6427A80133 __GI___clone + 0x43 at ??:0

Local fix

Problem summary

While a node is tryiing to join a cluster,
mmfsd start could encounter a null
pointer reference and crash with a
signal 11 with a backstack that looks like this:
[D] #0: 0x0000559601506BCE RGMaster::getNode
FullDomainName(NodeAddr, char**) + 0xAE at ??:0
[D] #1: 0x000055960150CAA2 RGMaster::
rgListServers(int, unsigned int) + 0x212 at ??:0
[D] #2: 0x000055960145F21C runTSLsRecoveryGroupV2
(int, StripeGroup*, int, char**) + 0xA8C at ??:0
[D] #3: 0x0000559601460371 runTSLsRecoveryGroup
(int, StripeGroup*, int, char**) + 0xB1 at ??:0

Problem conclusion

Benefits of the solution:
Fix will prevent daemon from crashing under such event.
Work around:
N/A
Problem trigger:
A node trying to join a cluster.
Symptom:
Daemon crashes with signal 11:

Platforms affected:
N/A
Functional Area affected:
GNR
Customer Impact:
Suggested
Changed Externals:
No

Temporary fix

Comments

APAR Information

APAR number
IJ28610
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
505
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-10-07
Closed date
2020-10-19
Last modified date
2020-10-19

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

IJ29155

Fix information

Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"505","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
07 November 2020

Tips

IJ28610: SIGNAL 11 RGMASTER::GETNODEFULLDOMAINNAME

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?