IBM Support

IBM Spectrum Scale Active File Management (AFM) and AFM Asynchronous Disaster Recovery (DR)

Flashes (Alerts)


Abstract

IBM has identified certain situations with respect to Active File Management (AFM) and AFM Asynchronous Disaster Recovery (DR) in IBM Spectrum Scale that may result in undetected data corruption:

- AFM may intermittently read files from the home cluster incorrectly which could result in undetected data corruption due to Direct IO usage.
- AFM may have undetected data corruption when eviction and read operations run in parallel on the same file.
- AFM cache may incorrectly read a file from the home cluster due to the incorrect calculation of the file sparseness information, potentially resulting in undetected data corruption.
- If parallel IO is enabled, AFM and AFM Asynchronous DR may experience undetected data corruption with failover, resync and changeSecondary commands.
- AFM Asynchronous DR failback may read HSM migrated files from the acting AFM Primary cluster (originally the AFM Secondary cluster) as sparse files, potentially resulting in the AFM cache to return incorrect data (all zeros) to an application on a read.

Content

IBM has identified certain situations with respect to Active File Management (AFM) and AFM Asynchronous Disaster Recovery (DR) in IBM Spectrum Scale that may result in undetected data corruption:

1. AFM may intermittently read files from the home cluster incorrectly which could result in an undetected data corruption due to Direct IO usage.


    Problem Summary:
    As a result of Direct IO usage, undetected data corruption may occur while reading a file from the home cluster. Applications may fail after reading the file from the home cluster or undetected data corruption may occur.

    Users affected:
    Users may be affected when using AFM caching (all modes) running IBM Spectrum Scale V4.2.0.0 thru 4.2.0.4

    Recommendations:
    - Any affected users should apply an efix for APAR IV87388 for their level of code by contacting IBM Service.
    - If you believe that your GPFS file system may be affected by this issue, please contact IBM Service as soon as possible for further guidance and assistance.


2. In the event of manual eviction and read operations running in parallel on the same file, AFM may experience undetected data corruption.

    Problem Summary:
    Undetected data corruption may occur as a result of running manual eviction and read operations in parallel on the same file. If a manual eviction is started on a file when the application or AFM prefetch has already begun reading such file from home, eviction will clear all of the data blocks allocated to the file and the read process will incorrectly cache the file as a sparse file, resulting in the application displaying zeros on a read. This situation will not occur during AFM auto eviction because AFM auto-eviction selects files using the LRU approach.

    Users affected:
    Users may be affected when both of the following conditions are met:
    1. AFM caching running GPFS V3.5.0.11 thru 3.5.0.31, V4.1.0.0 thru 4.1.0.8, or IBM Spectrum Scale V4.1.1.0 thru V4.1.1.8, or V4.2.0.0 thru V4.2.0.4; and
    2. The initiation of a manual eviction command on a file concurrent with the AFM gateway node reading the same file from the home cluster.

    Recommendations:
    - Any user meeting both conditions should apply an efix for their level of code by contacting IBM Service:
    V3.5.0.0 thru 3.5.0.31, apply 3.5.0.32 or contact IBM Service for APAR IV87370
    V4.1.0.0 thru 4.1.0.8, apply APAR IV87371
    V4.1.1.0 thru 4.1.1.8, apply APAR IV87372
    V4.2.0.0 thru 4.2.0.4, apply APAR IV87368
    - If you believe that your GPFS file system may be affected by this issue, please contact IBM Service as soon as possible for further guidance and assistance.

3. AFM cache may incorrectly read a file from the home cluster due to the incorrect calculation of the file sparseness information, potentially resulting in undetected data corruption.

    Problem Summary:
    AFM queries sparse information of a file from the home cluster before reading the file to exactly read the same number of blocks and make it a sparse file at cache also. If file meta-data is not committed to disk at the home cluster, the file disk address changes may not be reflected in the indirect block when cache queries sparse information for the file. This situation may cause sparse information from the home cluster to be returned incorrectly. If the subject file size is more than afmReadSparseThreshold (default 128MB),then the incorrect sparse information return will result in the AFM cache improperly reading the files as a sparse file even though the file is not sparse at the home cluster. This situation may occur when the cache starts reading the file immediately after the file was written at home and home is running GPFS with AFM enabled.

    Users affected:
    Users may be affected when all of the following conditions are met:
    1. IBM Spectrum Scale (GPFS) V4.1.0.0 thru 4.1.0.8, V4.1.1.0 thru 4.1.1.8, or V4.2.0.0 thru 4.2.0.4 is running;
    2. The home cluster is running GPFS and home is enabled for AFM (the mmafmconfig command was executed); and
    3. File size exceeds afmReadSparseThreshold (default, 128MB)

    Recommendations:
    - Any customer meeting these conditions should apply an efix for their level of code by contacting IBM Service:
    V4.1.0.0 thru 4.1.0.8, apply APAR IV87384
    V4.1.1.0 thru 4.1.1.8, apply APAR IV87385
    V4.2.0.0 thru 4.2.0.4, apply APAR IV87383
    - If you believe that your GPFS file system may be affected by this issue, please contact IBM Service as soon as possible for further guidance and assistance.

4. When parallel IO is enabled, the use of the failover or resync commands (AFM caching modes) or the changeSecondary command (AFM DR mode) may result in undetected data corruption.

    Problem Summary:
    When parallel IO is enabled, the master gateway node splits the write request into multiple chunks and assigns the work of writing the file to multiple gateway nodes. After each write, the file modification time (mtime) of the home file is updated with the cache file modification time. AFM uses the file modification time to verify whether the file is changed between cache and home. As a result of the file modification time matching between cache and home (or between primary and secondary) during next write chunk, the write chunk is incorrectly dropped due to file mtime match. This situation affects only failover or resync or changeSecondary paths with parallel IO enabled.

    Users affected:
    Users may be affected when all of the following conditions are met:
    1. IBM Spectrum Scale V4.1.1.4 thru 4.1.1.7 or V4.2.0.0 thru 4.2.0.3 is running;
    2. Failover or resync commands (AFM caching modes) or the changeSecondary command (AFM DR mode) with parallel IO enabled;
    3. File size exceeds afmParallelWriteThreshold (default, 1GB).

    Recommendations:
    - Any customer meeting all of the conditions should apply an efix for their level of code by contacting IBM Service:
    V4.1.1.4 thru 4.1.1.7, apply 4.1.1.8 or contact IBM Service for APAR IV85385
    V4.2.0.0 thru 4.2.0.3, apply 4.2.0.4 or contact IBM Service for APAR IV86161
    - If you believe that your GPFS file system may be affected by this issue, please contact IBM Service as soon as possible for further guidance and assistance.

5. AFM Asynchronous DR failback may read HSM migrated files from the acting AFM Primary cluster (originally the AFM Secondary cluster) as sparse files. This situation may result in the AFM cache to return incorrect data (all zeros) to an application on a read.


    Problem Summary:
    As a result of a missing read on HSM migrated files at the acting Primary cluster, upon failback the original AFM DR Primary cluster may incorrectly read the migrated file from the acting Primary cluster as a fully sparse file.

    Users affected:
    Users may be affected when both of these conditions are met:
    1. IBM Spectrum Scale v4.2.0.0 thru 4.2.0.4 is running at the AFM Primary cluster; and
    2. HSM migration is enabled on the AFM Secondary cluster

    Recommendations:
    - Any customer planning to use AFM DR with the affected IBM Spectrum Scale V4.2 code levels installed (V4.2.0.0 thru 4.2.0.4) should refrain from enabling HSM migration on the Secondary side of an AFM DR fileset relationship until the application of an efix for APAR IV87373 for their level of code by contacting IBM Service.
    - If you believe that your GPFS file system may be affected by this issue, please contact IBM Service as soon as possible for further guidance and assistance.

[{"Product":{"code":"SSFKCN","label":"General Parallel File System"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"}],"Version":"3.5.0","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
25 September 2022

UID

isg3T1024249