IBM Support

IJ09645: QOS: GPFS DEADLOCK

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • In cluster with QOS enabled,  GPFS daemon might get into
    a deadlock, operations to some files/folders might hang
    and "mmdiag --waiters" output shows many long waiters
    waiting for the same SyncPairCondvar, for example:
    
    Waiting 906.3732 sec since 03:07:25, monitored, thread
    4445 QosStatsRunQueueThread: on ThCond 0x180637241A8
    (SyncPairCondvar), reason 'waiting for exclusive
    ThSXLock'
    Waiting 906.3732 sec since 03:07:25, monitored, thread
    6636 SharedHashTabFetchHandlerThread: on ThCond
    0x180637241A8 (SyncPairCondvar), reason 'waiting for
    exclusive ThSXLock'
    Waiting 906.3732 sec since 03:07:25, monitored, thread
    4652 Msg handler QosmnTimeMsg: on ThCond 0x180637241A8
    (SyncPairCondvar), reason 'waiting for exclusive
    ThSXLock'
    Waiting 906.3732 sec since 03:07:25, monitored, thread
    5033 Msg handler QosmnTimeMsg: on ThCond 0x180637241A8
    (SyncPairCondvar), reason 'waiting for exclusive
    ThSXLock'
    Waiting 906.3732 sec since 03:07:25, monitored, thread
    6342 Msg handler QosmnTimeMsg: on ThCond 0x180637241A8
    (SyncPairCondvar), reason 'waiting for exclusive
    ThSXLock'
    Waiting 906.3732 sec since 03:07:25, monitored, thread
    6336 Msg handler QosmnTimeMsg: on ThCond 0x180637241A8
    (SyncPairCondvar), reason 'waiting for exclusive
    ThSXLock'
    Waiting 906.3732 sec since 03:07:25, monitored, thread
    5031 Msg handler QosmnTimeMsg: on ThCond 0x180637241A8
    (SyncPairCondvar), reason 'waiting for exclusive
    ThSXLock'
    Waiting 856.2593 sec since 03:08:15, monitored, thread
    3954 SharedHashTabFetchHandlerThread: on ThCond
    0x180637241A8 (SyncPairCondvar), reason 'waiting for
    exclusive ThSXLock'
    Waiting 680.0553 sec since 03:11:11, monitored, thread
    6610 SharedHashTabFetchHandlerThread: on ThCond
    0x180637241A8 (SyncPairCondvar), reason 'waiting for
    exclusive ThSXLock'
    
    
    Reported in:
    Spectrum Scale 5.0.1 on RHEL 7.5
    
    Known Impact:
    GPFS daemon deadlock, some file system operations hung.
    
    Recovery action:
    Restart GPFS on the node where longest waiter is seen.
    

Local fix

  • Disable QOS temporarily before fix applied.
    

Problem summary

  • A single thread self-deadlock problem happened when fine-grained
    QOS statistics is enabled.
    

Problem conclusion

  • Fixed the deadlock problem and therefore removed the I/O threads
     hang problem as well.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ09645

  • Reported component name

    SPEC SCALE ADV

  • Reported component ID

    5737F35AP

  • Reported release

    501

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2018-09-24

  • Closed date

    2018-10-23

  • Last modified date

    2018-10-23

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    IJ11043

Fix information

  • Fixed component name

    SPEC SCALE ADV

  • Fixed component ID

    5737F35AP

Applicable component levels

  • R501 PSY

       UP

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"501","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
23 October 2018