IBM Support

IJ29826: MEMORY GROWTH DURING AFM FILESET RECOVERY CAN LEAD TO OOM

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • glibc is known to use as many ARENAS (aka memory pools)
    as
    8 times the number of CPU threads a systems has. This
    makes a multi-threaded program like AFM which allocates
    memory for queues to use a lot more memory than actually
    needed.
    
    Although afmHardMemThreshold is honored and the
    AFM memory allocation does not exceed this limit,
    situations can arise where the AFM queue memory is freed
    by the mmfsd daemon, but the memory is not necessarily
    returned to the kernel for re-use. Thus mmfsd believes it
    has freed the memory and allocates new structures up to
    afmHardMemThreshold, but in reality the overall mmfsd
    memory footprint has continued to grow.
    
    Ultimately, all of the system  memory could be exhausted
    leading to the Linux OOM killer running.
    
    AFM seems to have exposed this problem, but the problem
    should not be limited to only systems using AFM.
    
    Reported In:
    
    The problem was reported from a system running Spectrum
    Scale version 5.0.4.4 on RHEL 7.7 for ppc64le, but the
    potential for this problem is believed to exist for all
    Spectrum Scale versions and operating systems.
    
    Known Impact:
    
    The impact from this problem is severe: system memory
    could be completely exhausted, leading to unpredictable
    failures across all system processes. The Linux OOM
    killer
    could be invoked, which would select processes to
    terminate in an effort to bring down the memory usage.
    

Local fix

  • The fix associated with this APAR also require
    configuration settings to be put in place. Please contact
    IBM Support for those details.
    
    Prior to obtaining the fix, the only recovery action is
    to recycle Spectrum Scale on the affected node
    (mmshutdown
    and mmstartup).
    

Problem summary

  • AFM gateway nodes runs out of memory during resync
    glibc is known to use as many arenas as 8 times the
    number of CPU threads a systems has. This makes
    a multi-threaded program like AFM which allocates
    memory for queues to use a lot more memory than
    actually needed.
    

Problem conclusion

  • Benefits of the solution:
    Provided a config option maxMallocArenas to limit
    the number of malloc arenas. For AFM, this
    option also tries to reclaim unused memory when
    queue memory usage is over a soft limit.
    
    Work around:
    None
    
    Problem trigger:
    AFM resync under heavy workload
    
    Symptom: Crash/Abend
    
    Platforms affected: ALL Linux OS environments
    
    Functional Area affected: AFM
    
    Customer Impact: Critical
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ29826

  • Reported component name

    SPEC SCALE DME

  • Reported component ID

    5737F34AP

  • Reported release

    505

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2020-12-09

  • Closed date

    2020-12-16

  • Last modified date

    2020-12-16

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    IJ29960

Fix information

  • Fixed component name

    SPEC SCALE DME

  • Fixed component ID

    5737F34AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"505","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
12 January 2021