IBM Support

IJ33254: AFM: RECOVERY WRITE MESSAGES ARE DROPPED WITH PARALLEL IO ENABLE

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • During the AFM recovery, if Parallel IO is enabled, and
    write split messages gets an error from home cluster,
    the write message gets dropped which causes data mismatch
    between cache and home.
    
    The following error log can be seen in mmfs log on
    AFM gateway node:
    
    2021-05-18_23:20:22.996+0800: [E] AFM: Write file system
    gpfsfs fileset myfileset file IDs
    [185628707.185628707.-1.-1,R] name  local error 233
    2021-05-18_23:20:22.996+0800: Host is down
    
    Reported In:
    Spectrum Scale 5.1.1.0
    
    Known Impact:
    Data mismatch between cache and home
    

Local fix

  • # Disable Parallel IO
     mmchfileset <fs> <fsetName> -p
    afmParallelWriteThreshold=disable
    # Trigger another recovery to sync the file again
     mmafmctl <fs> stop -j <fsetName>
     mmafmctl <fs> start -j <fsetName>
    

Problem summary

  • AFM might incorrectly drop write messages during an AFM
    recovery, causing the data mismatch between cache or
    primary and  home or secondary cluster. AFM recovery is
    triggered if in-memory queue is lost, for example a gateway
    node restart. With parallel IO enabled,
    WriteSplit messages are
    sent to the worker gateway nodes to
    write the file parallelly. If the
    WriteSplit message fails on the worker gateway node, failed
    WriteSplit request is retried for 3
    times before dropping the request.
    Since the Write request is dropped
    without replicating the data to
    the home or secondary, it will result
    in data mismatch between the
    cache or primary and home or secondary.
    

Problem conclusion

  • This problem is fixed in 5.1.1  PTF 2
    To see all Spectrum Scale APARs and
    their respective fix solutions refer to page
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
    apars.html
    
    Benefits of the solution:
    Fixed the code not to drop parallel IO write
    messages during the network errors.
    
    Work around:
    Disable parallel IO using the command "mmchfileset
    device fileset -p afmParallelWriteThreshold=disable
    
    Problem trigger:
    AFM recovery with parallel IO enabled.
    
    Symptom:
    Unexpected Results/Behavior
    
    Platforms affected:
    ALL Linux OS environments
    
    Functional Area affected:
    AFM and AFM DR
    
    Customer Impact:
    HiPER
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ33254

  • Reported component name

    SPEC SCALE STD

  • Reported component ID

    5737F33AP

  • Reported release

    511

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2021-06-15

  • Closed date

    2021-07-01

  • Last modified date

    2021-07-01

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE STD

  • Fixed component ID

    5737F33AP

Applicable component levels

  • R511 PSY

       UP

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"511","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
16 July 2021