IBM Support

IV35184: DEADLOCK HANG IN SNAPSHOT CODE DOING CHDIR AND RM OPERATIONS APPLIES TO AIX 7100-00

A fix is available

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • Machine hangs when trying to perform cd and rm operations
    in
    filesystems that have snapshots. Other commands like snap
     or
    lspv may also hang. Problem is caused by race condition
    in snapshot
    filesystem code where two processes end up  both holding
    a lock
    and being blocked on what the other is holding. It is a
    classic
    deadlock situation in the kernel which only can be
    overcome by
    reboot. The processes involved were running the cd and rm
    commands.
    
    Stacks involved in hang look like:
    (0)> f 2657
    pvthread+0A6100 STACK:
    [0052D660]slock+000480 (00000000000D3870,
    8000000000001032 [??])
    [00009558].simple_lock+000058 ()
    [0027EDC4]siAlloc+000044 (??, ??, ??, ??)
    [0027CCC0]siWriterReadSMap+0003C0 (F1000A06438EDC80,
    00000000015AE844,
       F00000002FF457D0, 0000000100000001)
    [00283BC0]siIOD+000140 (??, ??, ??, ??, ??)
    [00273E10]txIODUpdateSMap+000110 (??, ??, ??, ??)
    [00277810]xtLog+000350 (??, ??, ??, ??)
    [0028EC60]xtTruncate+000300 (??, ??, ??, ??, ??)
    [00275DDC]txLog+00029C (??, ??, ??)
    [002781D4]txCommit+000654 (??, ??, ??, ??)
    [002AF178]j2_remove+000438 (??, ??, ??, ??)
    [0057C0A4]vnop_remove+0003E4 (??, ??, ??, ??)
    [00672580]kunlink+000300 (??, ??)
    [00003850]ovlya_addr_sc_flih_main+000130 ()
    [kdb_get_virtual_memory] no real storage @ 2FF228E8
    [10052490]10052490 ()
    [kdb_read_mem] no real storage @ FFFFFFFFFFF92A0
    ------------------------------------
    and
    (0)> f 2395
    pvthread+095B00 STACK:
    [00527BF4]complex_lock_sleep_ppc+0001D4
    (00000000000D3870, 8000000000001032,
       0000000044288848, F00000002FF461A0 [??])
    [0052927C]lock_read_ppc+00095C (??)
    [00280964]siReaderLookupSMap+0000E4 (??, ??, ??, ??, ??,
    ??, ??)
    [00263D38]smRead+0001B8 (??, ??)
    [00236D00]bmStartIOOne+000120 (??)
    [0023C6FC]bmRead+00027C (??, ??, ??, ??, ??, ??)
    [002896D4]xtSearch+0005D4 (??, ??, ??, ??, ??)
    [00291A24]xtLookup+000064 (??, ??, ??, ??, ??, ??, ??)
    [0023C9B8]bmRead+000538 (??, ??, ??, ??, ??, ??)
    [00272F30]diMount+000050 (??)
    [002857B4]siAttach+000414 (??, ??, ??, ??)
    [00346FE8]j2_lookup+0004C8 (??, ??, ??, ??, ??, ??)
    [0057E364]vnop_lookup+000184 (??, ??, ??, ??, ??, ??)
    [00540CE4]lookuppn+000A04 (??, ??, ??, ??, ??, ??, ??,
    ??)
    [005414A0]lookupname_internal+0000A0 (??, ??, ??, ??, ??,
    ??, ??, ??)
    [0067B9D8]chdirec+000058 (??, ??, ??, ??)
    [0067B844]chdir+000124 (??)
    [00003850]ovlya_addr_sc_flih_main+000130 ()
    [kdb_get_virtual_memory] no real storage @ 2FF22598
    

Local fix

  • There is no local fix except to reboot after hang is
    discoverd.
    It would seem to be a pretty rare hang since it is seen
    in downlevel
    code that has been around for some time.
    

Problem summary

  • Machine hangs when trying to perform cd and rm operations
    in filesystems that have snapshots. Other commands like snap
    or lspv may also hang. Problem is caused by race condition
    in snapshot filesystem code where two processes end up both
    holding a lock and being blocked on what the other is holding.
    It is a classic deadlock situation in the kernel which only
    can be overcome by reboot. The processes involved were running
    the cd and rm commands.
    
    Stacks involved in hang look like:
    (0)> f 2657
    pvthread+0A6100 STACK:
     0052D660 slock+000480 (00000000000D3870,
     00009558 .simple_lock+000058 ()
     0027EDC4 siAlloc+000044 (??, ??, ??, ??)
     0027CCC0 siWriterReadSMap+0003C0 (F1000A06438EDC80,
     00283BC0 siIOD+000140 (??, ??, ??, ??, ??)
     00273E10 txIODUpdateSMap+000110 (??, ??, ??, ??)
     00277810 xtLog+000350 (??, ??, ??, ??)
     0028EC60 xtTruncate+000300 (??, ??, ??, ??, ??)
     00275DDC txLog+00029C (??, ??, ??)
     002781D4 txCommit+000654 (??, ??, ??, ??)
     002AF178 j2_remove+000438 (??, ??, ??, ??)
     0057C0A4 vnop_remove+0003E4 (??, ??, ??, ??)
     00672580 kunlink+000300 (??, ??)
     00003850 ovlya_addr_sc_flih_main+000130 ()
    ------------------------------------
    and
    (0)> f 2395
    pvthread+095B00 STACK:
     00527BF4 complex_lock_sleep_ppc+0001D4
     0052927C lock_read_ppc+00095C (??)
     00280964 siReaderLookupSMap+0000E4 (??, ??, ??, ??, ??,
     00263D38 smRead+0001B8 (??, ??)
     00236D00 bmStartIOOne+000120 (??)
     0023C6FC bmRead+00027C (??, ??, ??, ??, ??, ??)
     002896D4 xtSearch+0005D4 (??, ??, ??, ??, ??)
     00291A24 xtLookup+000064 (??, ??, ??, ??, ??, ??, ??)
     0023C9B8 bmRead+000538 (??, ??, ??, ??, ??, ??)
     00272F30 diMount+000050 (??)
     002857B4 siAttach+000414 (??, ??, ??, ??)
     00346FE8 j2_lookup+0004C8 (??, ??, ??, ??, ??, ??)
     0057E364 vnop_lookup+000184 (??, ??, ??, ??, ??, ??)
     00540CE4 lookuppn+000A04 (??, ??, ??, ??, ??, ??, ??,
     005414A0 lookupname_internal+0000A0 (??, ??, ??, ??, ??,
     0067B9D8 chdirec+000058 (??, ??, ??, ??)
     0067B844 chdir+000124 (??)
    

Problem conclusion

  • Fix serialization during smap page writes.
    

Temporary fix

Comments

  • 6100-06 - use AIX APAR IV23346
    6100-07 - use AIX APAR IV33759
    6100-08 - use AIX APAR IV29780
    6100-09 - use AIX APAR IV30215
    7100-00 - use AIX APAR IV35184
    7100-01 - use AIX APAR IV34863
    7100-02 - use AIX APAR IV29829
    

APAR Information

  • APAR number

    IV35184

  • Reported component name

    AIX V7.1

  • Reported component ID

    5765H4000

  • Reported release

    710

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Submitted date

    2013-01-15

  • Closed date

    2013-01-15

  • Last modified date

    2013-11-23

  • APAR is sysrouted FROM one or more of the following:

    IV23346

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    AIX V7.1

  • Fixed component ID

    5765H4000

Applicable component levels

  • R710 PSY U854839

       UP13/04/25 I 1000

PTF to Fileset Mapping

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SSMV87","label":"AIX 6.1 Enterprise Edition"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"710","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSMVAX","label":"AIX Express Edition"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"710","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11R","label":"AIX 7.1 HIPERS, APARs and Fixes"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"710","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
23 November 2013