Flashes (Alerts)
Abstract
IBM has identified certain issues affecting Active File Management (AFM) and AFM Asynchronous Disaster Recovery (ADR) in IBM Spectrum Scale, which might result in undetected data corruption.
Content
AFM might incorrectly drop write messages during an AFM recovery, causing a data mismatch between cache or primary and home or secondary cluster. AFM recovery is triggered if the in-memory queue is lost, for example during a gateway node restart. With parallel IO enabled, WriteSplit messages are sent to the worker gateway nodes to write the file parallelly. If the WriteSplit message fails on the worker gateway node, the failed WriteSplit request is retried 3 times before dropping the request. Since the Write request is dropped without replicating the data to the home or secondary, it results in a data mismatch between the cache or primary and home or secondary. Parallel IO is enabled by creating the export map for the NFS and it is enabled by default for the NSD protocol (remote cluster mount). If the file size is more than afmParallelWriteThrehsold (default 1GB), the file is split into smaller chunks of size afmParallelWriteChunkSize (default 128MB) and is written from the multiple gateway nodes, which are part of the export map.
Parallel IO documentation link:
https://www.ibm.com/docs/en/spectrum-scale/5.1.1?topic=features-parallel-data-transfers
https://www.ibm.com/docs/en/spectrum-scale/5.1.1?topic=features-parallel-data-transfers
Users affected:
Users running AFM and AFM DR on IBM Spectrum Scale V5.0.0.0 through V5.1.1.0 are potentially affected.
Users running AFM and AFM DR on IBM Spectrum Scale V5.0.0.0 through V5.1.1.0 are potentially affected.
Problem Determination:
At the gateway node, Write operation errors are reported in the /var/adm/ras/mmfs.log file in following manner:
2021-05-18_22:30:10.225+0800: [I] AFM: Detecting operations to be recovered...
2021-05-18_22:30:10.229+0800: [I] AFM: Found 1 update operations...
2021-05-18_22:30:10.235+0800: [I] AFM: Starting 'queue' operation for fileset 'fileset1' in filesystem 'gpfs1'.
2021-05-18_22:30:10.235+0800: [I] Command: tspcache gpfs1 1 fileset1 0 3 578832184 70 0 35 0
2021-05-18_22:30:10.287+0800: [I] Command: successful tspcache gpfs1 1 fileset1 0 3 578832184 70 0 35 0
2021-05-18_22:30:10.293+0800: [I] AFM: Finished queuing recovery operations for /gpfs/gpfs1
2021-05-18_23:20:22.996+0800: [E] AFM: Write file system gpfs1 fileset fileset1 file IDs [185628707.185628707.-1.-1,R] name local error 233
2021-05-18_23:20:22.996+0800: Host is down
2021-05-14_15:37:24.370+0800: [E] AFM: Write file system gpfs1 fileset fileset1 file IDs [185628707.185628707.-1.-1,R] name local error 233
2021-05-15_08:03:15.908+0800: [E] AFM: Write file system gpfs1 fileset fileset1 file IDs [181824526.181824526.-1.-1,R] name local error 233
At the gateway node, Write operation errors are reported in the /var/adm/ras/mmfs.log file in following manner:
2021-05-18_22:30:10.225+0800: [I] AFM: Detecting operations to be recovered...
2021-05-18_22:30:10.229+0800: [I] AFM: Found 1 update operations...
2021-05-18_22:30:10.235+0800: [I] AFM: Starting 'queue' operation for fileset 'fileset1' in filesystem 'gpfs1'.
2021-05-18_22:30:10.235+0800: [I] Command: tspcache gpfs1 1 fileset1 0 3 578832184 70 0 35 0
2021-05-18_22:30:10.287+0800: [I] Command: successful tspcache gpfs1 1 fileset1 0 3 578832184 70 0 35 0
2021-05-18_22:30:10.293+0800: [I] AFM: Finished queuing recovery operations for /gpfs/gpfs1
2021-05-18_23:20:22.996+0800: [E] AFM: Write file system gpfs1 fileset fileset1 file IDs [185628707.185628707.-1.-1,R] name local error 233
2021-05-18_23:20:22.996+0800: Host is down
2021-05-14_15:37:24.370+0800: [E] AFM: Write file system gpfs1 fileset fileset1 file IDs [185628707.185628707.-1.-1,R] name local error 233
2021-05-15_08:03:15.908+0800: [E] AFM: Write file system gpfs1 fileset fileset1 file IDs [181824526.181824526.-1.-1,R] name local error 233
Recommendations:
Any customer seeing the "local error 233" message in the /var/adm/ras/mmfs.log file needs to apply the fix on the AFM gateway nodes at cache or primary cluster, by requesting a fix APAR IJ33428 for IBM Spectrum Scale V5.0.x and APAR IJ33424 for IBM Spectrum Scale V5.1.x.
A subsequent update will be notified when the IBM Spectrum Scale fix versions are generally released.
If you believe that your GPFS file system might be affected by this issue, contact IBM Service as soon as possible for further guidance and assistance.
Until a fix (in PTF or efix form) is applied, when the described symptom (shown in the Problem Determination section) is observed, the user should:
a) For single writer mode fileset, perform resync:
mmafmctl device resync -j fileset.
b) For AFM DR primary mode fileset, perform changeSecondary:
mmafmctl device changeSecondary -j fileset --new-target existingAfmTarget --inband
c) For independent writer mode fileset,
mmafmctl devicefailover -j fileset--new-target existingAfmTarget
Any customer seeing the "local error 233" message in the /var/adm/ras/mmfs.log file needs to apply the fix on the AFM gateway nodes at cache or primary cluster, by requesting a fix APAR IJ33428 for IBM Spectrum Scale V5.0.x and APAR IJ33424 for IBM Spectrum Scale V5.1.x.
A subsequent update will be notified when the IBM Spectrum Scale fix versions are generally released.
If you believe that your GPFS file system might be affected by this issue, contact IBM Service as soon as possible for further guidance and assistance.
Until a fix (in PTF or efix form) is applied, when the described symptom (shown in the Problem Determination section) is observed, the user should:
a) For single writer mode fileset, perform resync:
mmafmctl device resync -j fileset.
b) For AFM DR primary mode fileset, perform changeSecondary:
mmafmctl device changeSecondary -j fileset --new-target existingAfmTarget --inband
c) For independent writer mode fileset,
mmafmctl devicefailover -j fileset--new-target existingAfmTarget
[{"Type":"MASTER","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"ARM Category":[{"code":"a8m50000000KzgNAAS","label":"AFM"}],"Platform":[{"code":"PF016","label":"Linux"}],"Version":"5.0.0;5.1.0"}]
Was this topic helpful?
Document Information
Modified date:
19 July 2021
UID
ibm16471519