APAR status
Closed as program error.
Error description
In cluster with QOS enabled, GPFS daemon might get into a deadlock, operations to some files/folders might hang and "mmdiag --waiters" output shows many long waiters waiting for the same SyncPairCondvar, for example: Waiting 906.3732 sec since 03:07:25, monitored, thread 4445 QosStatsRunQueueThread: on ThCond 0x180637241A8 (SyncPairCondvar), reason 'waiting for exclusive ThSXLock' Waiting 906.3732 sec since 03:07:25, monitored, thread 6636 SharedHashTabFetchHandlerThread: on ThCond 0x180637241A8 (SyncPairCondvar), reason 'waiting for exclusive ThSXLock' Waiting 906.3732 sec since 03:07:25, monitored, thread 4652 Msg handler QosmnTimeMsg: on ThCond 0x180637241A8 (SyncPairCondvar), reason 'waiting for exclusive ThSXLock' Waiting 906.3732 sec since 03:07:25, monitored, thread 5033 Msg handler QosmnTimeMsg: on ThCond 0x180637241A8 (SyncPairCondvar), reason 'waiting for exclusive ThSXLock' Waiting 906.3732 sec since 03:07:25, monitored, thread 6342 Msg handler QosmnTimeMsg: on ThCond 0x180637241A8 (SyncPairCondvar), reason 'waiting for exclusive ThSXLock' Waiting 906.3732 sec since 03:07:25, monitored, thread 6336 Msg handler QosmnTimeMsg: on ThCond 0x180637241A8 (SyncPairCondvar), reason 'waiting for exclusive ThSXLock' Waiting 906.3732 sec since 03:07:25, monitored, thread 5031 Msg handler QosmnTimeMsg: on ThCond 0x180637241A8 (SyncPairCondvar), reason 'waiting for exclusive ThSXLock' Waiting 856.2593 sec since 03:08:15, monitored, thread 3954 SharedHashTabFetchHandlerThread: on ThCond 0x180637241A8 (SyncPairCondvar), reason 'waiting for exclusive ThSXLock' Waiting 680.0553 sec since 03:11:11, monitored, thread 6610 SharedHashTabFetchHandlerThread: on ThCond 0x180637241A8 (SyncPairCondvar), reason 'waiting for exclusive ThSXLock' Reported in: Spectrum Scale 5.0.1 on RHEL 7.5 Known Impact: GPFS daemon deadlock, some file system operations hung. Recovery action: Restart GPFS on the node where longest waiter is seen.
Local fix
Disable QOS temporarily before fix applied.
Problem summary
A single thread self-deadlock problem happened when fine-grained QOS statistics is enabled.
Problem conclusion
Fixed the deadlock problem and therefore removed the I/O threads hang problem as well.
Temporary fix
Comments
APAR Information
APAR number
IJ09645
Reported component name
SPEC SCALE ADV
Reported component ID
5737F35AP
Reported release
501
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2018-09-24
Closed date
2018-10-23
Last modified date
2018-10-23
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE ADV
Fixed component ID
5737F35AP
Applicable component levels
R501 PSY
UP
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"501","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
23 October 2018