APAR status
Closed as program error.
Error description
glibc is known to use as many ARENAS (aka memory pools) as 8 times the number of CPU threads a systems has. This makes a multi-threaded program like AFM which allocates memory for queues to use a lot more memory than actually needed. Although afmHardMemThreshold is honored and the AFM memory allocation does not exceed this limit, situations can arise where the AFM queue memory is freed by the mmfsd daemon, but the memory is not necessarily returned to the kernel for re-use. Thus mmfsd believes it has freed the memory and allocates new structures up to afmHardMemThreshold, but in reality the overall mmfsd memory footprint has continued to grow. Ultimately, all of the system memory could be exhausted leading to the Linux OOM killer running. AFM seems to have exposed this problem, but the problem should not be limited to only systems using AFM. Reported In: The problem was reported from a system running Spectrum Scale version 5.0.4.4 on RHEL 7.7 for ppc64le, but the potential for this problem is believed to exist for all Spectrum Scale versions and operating systems. Known Impact: The impact from this problem is severe: system memory could be completely exhausted, leading to unpredictable failures across all system processes. The Linux OOM killer could be invoked, which would select processes to terminate in an effort to bring down the memory usage.
Local fix
The fix associated with this APAR also require configuration settings to be put in place. Please contact IBM Support for those details. Prior to obtaining the fix, the only recovery action is to recycle Spectrum Scale on the affected node (mmshutdown and mmstartup).
Problem summary
AFM gateway nodes runs out of memory during resync glibc is known to use as many arenas as 8 times the number of CPU threads a systems has. This makes a multi-threaded program like AFM which allocates memory for queues to use a lot more memory than actually needed.
Problem conclusion
Benefits of the solution: Provided a config option maxMallocArenas to limit the number of malloc arenas. For AFM, this option also tries to reclaim unused memory when queue memory usage is over a soft limit. Work around: None Problem trigger: AFM resync under heavy workload Symptom: Crash/Abend Platforms affected: ALL Linux OS environments Functional Area affected: AFM Customer Impact: Critical
Temporary fix
Comments
APAR Information
APAR number
IJ29826
Reported component name
SPEC SCALE DME
Reported component ID
5737F34AP
Reported release
505
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-12-09
Closed date
2020-12-16
Last modified date
2020-12-16
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE DME
Fixed component ID
5737F34AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"505","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
12 January 2021