Planning for the Highly Available Write Cache feature (HAWC)

Learn about the Highly Available Write Cache feature (HAWC).

Components that HAWC interacts with

HAWC interacts with several fundamental components of IBM Storage Scale. You might want to review these components before you read about HAWC.

System storage pool

The system storage pool or system pool is a required storage pool that contains information that IBM Storage Scale uses to manage a file system. Each file system has only one system storage pool, which is automatically created when the file system is created. The system storage pool contains the following types of information:

Control information (such as file system control structures, reserved files, directories, symbolic links, special devices)
The metadata associated with regular files, including indirect blocks and extended attributes
Regular file data, if the usage=dataAndMetadata option is set in the NSD stanza for a system storage pool NSD
The file system recovery logs (default location)

System.log storage pool

The system.log storage pool is an optional dedicated storage pool that contains only the file system recovery logs. If you define this pool, then IBM Storage Scale uses it for all the file system recovery logs of the file system. Otherwise, the file system recovery logs are kept in the system storage pool. It is a good practice for the system.log pool to consist of storage media that is as fast as or faster than the storage media of the system storage pool. If the storage is nonvolatile, this pool can be used for the high-availability write cache (HAWC).

File system recovery logs

A file system recovery log is a write-ahead log or journal of I/O metadata that describes pending write operations for a node of a file system. In IBM Storage Scale, it is also sometimes referred to as the recovery log, the GPFS log, or the IBM Storage Scale log. IBM Storage Scale creates and maintains a separate recovery log for every node that mounts a file system. Recovery logs are stored in the system storage pool by default or in the system.log storage pool if one is defined. The recovery logs can be read by any node that mounts the file system. If a node is unexpectedly shut down while write operations are pending for one of its hard disks, IBM Storage Scale can read the recovery log for the failed node and restore the file system to a consistent state. The recovery can occur immediately, without having to wait for the failed node to return.

The recovery logs are also used by HAWC to temporarily store HAWC write data and metadata.

Page pool

The page pool is an area of pinned memory (memory that is never paged to disk) that contains file data and metadata associated with in-progress I/O operations. When IBM Storage Scale processes a file write operation, the first step is putting the write data and metadata for the write operation into the page pool. At an appropriate time, another thread writes the data to a hard disk and removes it from the page pool.

HAWC operation

The high-availability write cache is a disk-caching component that includes caching software and nonvolatile storage. HAWC also uses the file system recovery logs, in which the file system records metadata about its pending write operations. For HAWC purposes, the recovery logs must be located in nonvolatile storage.

When a file write operation arrives at a node, the first part of the processing is the same whether HAWC is active or not. The write data and metadata are copied into an entry in the page pool and the entry is added to a list of similar entries that are waiting for processing. When the entry is processed, the processing depends on whether HAWC is active.

Note: If the write operation is nonsynchronous, it returns to its caller after its write data and metadata are copied into the page pool entry. If the write operation is synchronous, it waits for a notification that the file data has been written to disk.

When HAWC is not active, the write data is copied from the page pool entry and written to the file on hard disk. If the write operation is synchronous, the system notifies the write operation that the write is successful and it returns to its caller.

When HAWC is active, the write data can take either of two paths:

If the write operation is synchronous and the size of the file data is less than or equal to the write data threshold, HAWC copies the file data from the page pool entry into the recovery log, along with any I/O metadata that is required for recovery. The write data threshold variable is set by the mmcrfs command or the mmchfs command. Next HAWC notifies the original write operation that the file data is successfully written to hard disk. In fact, the file data is not written to hard disk yet, although it is preserved in the recovery log as a backup. HAWC then starts a write-behind thread that eventually writes the file data to the hard disk. When the data is safely written, HAWC purges the file data and I/O metadata from the recovery log, because it is no longer needed.
If the write operation is not synchronous or if the size of the write data is greater than the write cache threshold, then the write data follows the same path that is followed when HAWC is not active. The system copies the write data from the page pool entry and writes it to hard disk. If the original write operation is synchronous, the system notifies it that the file data is safely written to the hard disk.

HAWC improves the performance of small synchronous write operations in two ways. First, it allows synchronous write operations to return the calling application as soon as the write data is written into the recovery log. The calling application does not have to wait for the much lengthier process of writing the data to hard disk. Second, the HAWC caching software can consolidate small sequential writes into one larger write. This consolidation eliminates all but one of the initial disk seeks that is required if the data is written as multiple writes.

The write-cache threshold variable can be adjusted by specifying a value for the --write-cache-threshold parameter of the mmchfs command. The valid range is 0 - 64 K in multiples of 4 K. You can also set this variable when you create the file system by specifying the same parameter in the mmcrfs command. Setting the write cache threshold to zero disables HAWC. You can update the write threshold variable at any time; the file system does not have to be mounted on the node.

HAWC storage scenarios

You can set up the HAWC storage in either of two configurations or scenarios. In the first scenario, the nonvolatile storage is located in a centralized fast storage device, such as a controller with SSDs:

In this scenario, when a synchronous write operation arrives at a node, the file data and metadata are copied a page pool entry in the usual way. If the size of the file data is less than the write data threshold, HAWC copies the file data into the recovery log along with any I/O metadata that is required for recovery. Next, HAWC returns an acknowledgment to the write operation that indicates that the file data is successfully written to hard disk. HAWC then starts a write-behind thread that eventually writes the file data to the hard disk. When the data is safely written, HAWC purges the file data and I/O metadata for the write operation from the recovery log.

In the second scenario, the nonvolatile storage consists of multiple storage devices that are distributed across some or all of the nodes in the cluster:

Although the hardware configuration is different in the second scenario, the data flow is similar to the data flow of the first scenario. The synchronous write operation arrives at a node and the write data and I/O metadata are written into a page pool entry. If the size of the write data is smaller than the write storage threshold, HAWC copies the file data and relevant I/O metadata to the recovery log. The data is striped over the various disks that belong to the recovery log storage pool. HAWC returns a successful acknowledgment to the synchronous write operation and starts a write-behind thread that later writes the file data from the page pool entry to a hard disk. When the data is safely written, HAWC purges the file data and I/O metadata from the recovery log.

HAWC software configuration

After you install the nonvolatile storage devices, complete the following steps to configure IBM Storage Scale for HAWC. These steps assume that you are adding HAWC storage to an existing file system:

Stop the GPFS daemon on all the nodes of the cluster.
Create NSD stanzas for the nonvolatile storage devices. In the stanza, specify one storage pool for all the nonvolatile storage devices, which must be either the system pool or the system.log pool.
Run mmcrnsd to create the NSDs.
Run mmaddisk to add the NSDs to the file system and to create the system.log pool if necessary.
Start the GPFS daemons on all nodes.
Optionally, run the mmchfs command with the -L parameter to set the size of the recovery logs to a non-default value.
Optionally, run the mmchfs command with the --log-replicas parameter to set the number of replicas of the recovery log to a non-default value. This option is applicable only if the recovery logs are stored in the system.log pool.
To activate HAWC, run the mmchfs command with the --write-cache-threshold parameter set to a nonzero value.

HAWC is now active.