Volatile write cache detection

IBM Storage Scale Erasure Code Edition now has the ability to test if volatile write caching mode is enabled on the physical disks.

Many SCSI and NVMe drives support a volatile write caching mode in which a drive reports success back from write operations as soon as data has been received into the drive's internal cache memory. IBM Storage Scale Erasure Code Edition cannot be used with drives operating in this mode because on power failure, the cached data is lost, causing already committed data to revert to an older version. This can lead to corruption of both the RAID and file system metadata, resulting in data integrity issues.

If IBM Storage Scale Erasure Code Edition detects a drive with volatile write caching mode that is enabled, it puts the pdisk into a new volatile write cache that is enabled (VWCE) state and drains all data from the drive. If IBM Storage Scale Erasure Code Edition detects many drives with volatile write caching enabled, it stops service of the recovery group and waits for volatile write caching mode to be disabled on the drives.

The volatile write cache detection feature is enabled for all new IBM Storage Scale Erasure Code Edition installations starting from version 5.0.4. On previous installations, the feature is disabled by default and must be manually enabled in order to take advantage of the check.

ECE introduces asynchronous detection of volatile write cache in version 5.1.6. In the previous release, ECE only checked the volatile write cache mode when rediscovering disk states like RG initialization and RG master failover, or ran the command to refresh pdisk information.

Check IBM Storage Scale Erasure Code Edition cluster configuration for VWCE

IBM Storage Scale Erasure Code Edition supports volatile write cache detection from version 5.0.4, upgrade from previous versions need to enable it.

Use the following commands to check the current IBM Storage Scale Erasure Code Edition configuration for VWCE detection:

# mmdiag --config|grep nsdRAIDDiskCheckVWCE
If nsdRAIDDiskCheckVWCE is 1, it means enabled. If nsdRAIDDiskCheckVWCE is 0, it means disabled. Check all physical disks volatile write cache state before enabling it.
After making sure that all disks have disabled volatile write cache, use this command to enable it:
```
# mmchconfig nsdRAIDDiskCheckVWCE=yes -i
```

Rediscover the disk state with this command:

# mmvdisk rg change --recovery-group rg_name --refresh-pdisk-info

Creation of recovery group fails if volatile write cache mode is enabled on disk

Before you install IBM Storage Scale Erasure Code Edition and create a recovery group, run the ece_os_readiness tool first, it detects volatile write cache of disks and give you warning messages. When disks have volatile write cache mode that is enabled, creation of recovery group fails with error messages in the /var/adm/ras/mmfs.log.latest file.

Asynchronous detection of volatile write cache

IBM Storage Scale Erasure Code Edition introduces asynchronous detection of volatile write cache mode in version 5.1.6. The detection interval is controlled by configuration nsdRAIDDiskSMARTUpdateInterval, the default value is 24 hours. When Recovery Group is running, ECE will rediscover the disk volatile write cache mode every 24 hours. When pdisk state changes from missing to ok, ECE will also rediscover the pdisk volatile write cache mode. The pdisk will be put into VWCE mode and drain data from it if the volatile write cache enabled is detected.

Failure of disk replacement

For replacing failure disk, check and disable volatile write cache mode for the new physical disk. If volatile write cache mode is not disabled, replace command would fail.

Scale out IBM Storage Scale Erasure Code Edition by adding new node

Run the ece_os_readiness tool first on new node, and disable volatile write cache mode for each disk if needed before you add a node into an IBM Storage Scale Erasure Code Edition recovery group.

What to do if volatile write cache is detected

For instructions on how to disable volatile writer caching on SCSI and NVMe disks, see Hardware checklist.