APAR status
Closed as program error.
Error description
During the AFM recovery, if Parallel IO is enabled, and write split messages gets an error from home cluster, the write message gets dropped which causes data mismatch between cache and home. The following error log can be seen in mmfs log on AFM gateway node: 2021-05-18_23:20:22.996+0800: [E] AFM: Write file system gpfsfs fileset myfileset file IDs [185628707.185628707.-1.-1,R] name local error 233 2021-05-18_23:20:22.996+0800: Host is down Reported In: Spectrum Scale 5.1.1.0 Known Impact: Data mismatch between cache and home
Local fix
# Disable Parallel IO mmchfileset <fs> <fsetName> -p afmParallelWriteThreshold=disable # Trigger another recovery to sync the file again mmafmctl <fs> stop -j <fsetName> mmafmctl <fs> start -j <fsetName>
Problem summary
AFM might incorrectly drop write messages during an AFM recovery, causing the data mismatch between cache or primary and home or secondary cluster. AFM recovery is triggered if in-memory queue is lost, for example a gateway node restart. With parallel IO enabled, WriteSplit messages are sent to the worker gateway nodes to write the file parallelly. If the WriteSplit message fails on the worker gateway node, failed WriteSplit request is retried for 3 times before dropping the request. Since the Write request is dropped without replicating the data to the home or secondary, it will result in data mismatch between the cache or primary and home or secondary.
Problem conclusion
This problem is fixed in 5.1.1 PTF 2 To see all Spectrum Scale APARs and their respective fix solutions refer to page https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_ apars.html Benefits of the solution: Fixed the code not to drop parallel IO write messages during the network errors. Work around: Disable parallel IO using the command "mmchfileset device fileset -p afmParallelWriteThreshold=disable Problem trigger: AFM recovery with parallel IO enabled. Symptom: Unexpected Results/Behavior Platforms affected: ALL Linux OS environments Functional Area affected: AFM and AFM DR Customer Impact: HiPER
Temporary fix
Comments
APAR Information
APAR number
IJ33254
Reported component name
SPEC SCALE STD
Reported component ID
5737F33AP
Reported release
511
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2021-06-15
Closed date
2021-07-01
Last modified date
2021-07-01
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE STD
Fixed component ID
5737F33AP
Applicable component levels
R511 PSY
UP
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"511","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
16 July 2021