File Placement Optimizer

A cluster in which all disks planned for IBM Spectrum Scale can be accessed only from one server (that means, no one disk could be accessed by 2 or more servers) is called as sharing nothing cluster.

For sharing nothing cluster, there are two typical configurations: replica-based Spectrum Scale (sharing nothing cluster) and Spectrum Scale FPO.

If you do not run any workloads that could benefit from data locality (for example, SAP HANA + Spectrum Scale for X86_64 machines, Hadoop, Spark, IBM® DB2® DPF or IBM DashDB etc), you should not configure sharing nothing cluster as Spectrum Scale FPO. For such workloads, you just need to configure replica-based IBM Spectrum Scale. Otherwise, you could configure it as IBM Spectrum Scale FPO (File Placement Optimizer). For Spectrum Scale FPO, you could control the replica location in the file system.

When you create the storage pool over sharing nothing cluster, if you configure allowWriteAffinity=yes for the storage pool, you enable data locality for the data stored in the storage pool and this is called as FPO mode. If you configure allowWriteAffinity=no for the storage pool, this is called as replica-based sharing nothing mode. After the file system is created, the storage pool property allowWriteAffinity cannot be modified further.

In this chapter, all data locality related concepts (for example, allowWriteAffinity, Chunks, Extended failure groups, Write affinity failure group, Write affinity depth) are only effective for IBM Spectrum Scale FPO mode. For others concepts in this chapter, replica-based sharing nothing cluster is applicable.

Note: This feature is available with IBM Spectrum Scale Standard Edition or higher.

FPO uses the following entities and policies:

Chunks

A chunk is a logical grouping of blocks that allows the grouping to behave like one large block, useful for applications that need high sequential bandwidth. Chunks are specified by a block group factor that dictates how many file system blocks are laid out sequentially on disk to behave like a large block. Different Chunk size can be defined by block group factor on file level or defined globally on a storage pool by default.

On the file level, the block group factor can be specified by the --block-group-factor argument of the mmchattr command. You can also specify the block group factor by the setBGF argument of the mmchpolicy and mmapplypolicy command. The range of the block group factor is 1 - 1024. The default value is 1. You can also specify the block group factor through the blockGroupFactor argument in a storage pool stanza (as input to the mmadddisk or mmcrfs command).

The effective chunk size is a multiplication of Block Group Factor and GPFS block size. For example, setting block size to 1 MB and block group factor to 128 leads to an effective large block size of 128 MB.

See the following command descriptions in the IBM Spectrum Scale: Command and Programming Reference:

mmadddisk
mmchattr
mmcrfs
mmchpolicy
mmapplypolicy

Extended failure groups

A failure group is defined as a set of disks that share a common point of failure that might cause them all to become simultaneously unavailable. Traditionally, GPFS failure groups are identified by simple integers. In an FPO-enabled environment, a failure group might be specified as not just a single number, but as a vector of up to three comma-separated numbers. This vector conveys topology information that GPFS exploits when making data placement decisions.

In general, a topology vector is a way for the user to specify which disks are closer together and which are farther away. In practice, the three elements of the failure group topology vector might represent the rack number of a disk, a position within the rack, and a node number. For example, the topology vector 2,1,0 identifies rack 2, bottom half, first node.

Also, the first two elements of the failure group represent the failure group ID and the three elements together represent the locality group ID. For example, 2,1 is the failure group ID and 2,1,0 is the locality group ID for the topology vector 2,1,0.

The Data block placement decisions about the disk selection for data replica are made by GPFS based on the Failure group. When considering two disks for striping or replica placement purposes, it is important to understand the following:

Disks that differ in the first of the three numbers are farthest apart (as they are in different racks).
Disks that have the same first number but differ in the second number are closer (as they are in the same rack, but in different halves).
Disks that differ only in the third number reside in different nodes in the same half of the same rack.
Only disks that have all three numbers in common reside in the same node.

The data block placement decisions are also affected by the level of replication and the value of the writeAffinityDepth parameter. For example, when using replication 3, GPFS might place two replicas far apart (different racks) to minimize chances of losing both. However, the third replica can be placed close to one of the others (same rack, but different half), to reduce network traffic between racks when writing the three replicas.

To specify the topology vector that identifies a failure group, you use the failureGroup=FailureGroup attribute in an NSD stanza (as input to the mmadddisk or mmcrfs command).

See the following command descriptions in the IBM Spectrum Scale: Command and Programming Reference:

mmadddisk
mmcrfs

Write affinity depth

Write affinity depth is a policy that allows the application to determine the layout of a file in the cluster to optimize for typical access patterns. The write affinity is specified by a depth that indicates the number of localized copies (as opposed to wide striped). It can be specified at the storage pool or file level. The enabling of Write affinity depth, indicates that the first replica is being written on the node where the writing is triggered. It also indicates, the second and third replica (if any) are being written on the other node disks.

To specify write affinity depth, you use the writeAffinityDepth attribute in a storage pool stanza (as input to the mmadddisk or mmcrfs command) or the --write-affinity-depth argument of the mmchattr command. You can use --block-group-factor argument of the mmchpool command to change a storage pool's block group factor. You can change write affinity depth by --write-affinity-depth argument of mmchpool for a storage pool. You can also specify the write affinity depth for file by the setWAD argument of the mmchpolicy and mmapplypolicy commands.

A write affinity depth of 0 indicates that each replica is to be striped across the disks in a cyclical fashion with the restriction that no two disks are in the same failure group. By default, the unit of striping is a block; however, if the block group factor is specified in order to exploit chunks, the unit of striping is a chunk.

A write affinity depth of 1 indicates that the first copy is written to the writer node. The second copy is written to a different rack. The third copy is written to the same rack as the second copy, but on a different half (which can be composed of several nodes).

A write affinity depth of 2 indicates that the first copy is written to the writer node. The second copy is written to the same rack as the first copy, but on a different half (which can be composed of several nodes). The target node is determined by a hash value on the fileset ID of the file, or it is chosen randomly if the file does not belong to any fileset. The third copy is striped across the disks in a cyclical fashion with the restriction that no two disks are in the same failure group. The following conditions must be met while using a write affinity depth of 2 to get evenly allocated space in all disks:

The configuration in disk number, disk size, and node number for each rack must be similar.
The number of nodes must be the same in the bottom half and the top half of each rack.

This behavior can be altered on an individual file basis by using the --write-affinity-failure-group option of the mmchattr command.

Note: In fileset level, Write affinity depth of 2 is design to assign (write) all the files in a fileset to the same second-replica node. However, this behavior depends on node status in the cluster. After a node is added to or deleted from a cluster, a different node might be selected as the second replica for files in a fileset.

See the description of storage pool stanzas that follows. Also, see the following command descriptions in the IBM Spectrum Scale: Command and Programming Reference:

mmadddisk
mmchattr
mmcrfs
mmchpolicy
mmapplypolicy
mmchpool

Write affinity failure group

Write affinity failure group is a policy that indicates the range of nodes (in a shared nothing architecture) where replicas of blocks in a particular file are to be written. The policy allows the application to determine the layout of a file in the cluster to optimize for typical access patterns.

You specify the write affinity failure group through the write-affinity-failure-group WafgValueString attribute of the mmchattr command. You can also specify write affinity failure group through the setWADFG attribute of the mmchpolicy and mmapplypolicy command. Failure group topology vector ranges specify the nodes, and the specification is repeated for each replica of the blocks in a file.

For example, the attribute 1,1,1:2;2,1,1:2;2,0,3:4 indicates:

The first replica is on rack 1, rack location 1, nodes 1 or 2.
The second replica is on rack 2, rack location 1, nodes 1 or 2.
The third replica is on rack 2, rack location 0, nodes 3 or 4.

The default policy is a null specification. This default policy indicates that each replica must follow the storage pool or the file-write affinity depth (WAD) definition for data placement. Not wide striped over all disks.

When data in an FPO pool is backed up in the IBM Spectrum Protect server and then restored, the original placement map is broken unless you set the write affinity failure group for each file before backup.

Note: To change the failure group of a disk in a write-affinity–enabled storage pool, you must use the mmdeldisk and mmadddisk commands. You cannot use mmchdisk to change it directly.

See the following command descriptions in the IBM Spectrum Scale: Command and Programming Reference:

mmchpolicy
mmapplypolicy
mmchattr

Enabling the FPO features

To efficiently support write affinity and the rest of the FPO features, GPFS internally requires the creation of special allocation map formats. When you create a storage pool that is to contain files that make use of FPO features, you must specify allowWriteAffinity=yes in the storage pool stanza.

To enable the policy to read from preferred replicas, issue one of the following commands:

To specify that the policy read from the first replica, regardless of whether there is a replica on the disk, default to or issue the following:
```
mmchconfig readReplicaPolicy=default
```
To specify that the policy read replicas from the local disk, if the local disk has data, issue the following:
```
mmchconfig readReplicaPolicy=local
```
To specify that the policy read replicas from the fastest disk to read from based on the disk's read I/O statistics, run the following:
```
mmchconfig readReplicaPolicy=fastest
```

Note: In an FPO-enabled file system, if you run data locality awareness workload over FPO, such as Hadoop or Spark, configure readReplicaPolicy as local to read data from the local disks to reduce the network bandwidth consumption.

See the description of storage pool stanzas that follows. Also, see the following command descriptions in the IBM Spectrum Scale: Command and Programming Reference:

mmadddisk
mmchconfig
mmcrfs

Storage pool stanzas

Storage pool stanzas are used to specify the type of layout map and write affinity depth, and to enable write affinity, for each storage pool.

Storage pool stanzas have the following format:

%pool: 
  pool=StoragePoolName
  blockSize=BlockSize
  usage={dataOnly | metadataOnly | dataAndMetadata}
  layoutMap={scatter | cluster}
  allowWriteAffinity={yes | no}
  writeAffinityDepth={0 | 1 | 2}
  blockGroupFactor=BlockGroupFactor

See the following command descriptions in the IBM Spectrum Scale: Command and Programming Reference:

mmadddisk
mmcrfs
mmchpool

Recovery from disk failure

A typical shared nothing cluster is built with nodes that have direct-attached disks. Disks are not shared between nodes as in a regular GPFS cluster, so if the node is inaccessible, its disks are also inaccessible. GPFS provides means for automatic recovery from these and similar common disk failure situations.

The following command sets up and activates the disk recovery features:

mmchconfig restripeOnDiskFailure=yes -i

Usually, auto recovery must be enabled in an FPO cluster to protect data from multiple node failures. Set mmchconfig restripeOnDiskFailure=yes -N all. However, if one file system has only two failure groups for metadata or data with default replica two, or if one file system has only 3 failure groups for metadata or data with default replica 3, auto recovery must be disabled (mmchconfig restripeOnDiskFailure=no -N all) for Spectrum Scale 4.1.x, 4.2.x and 5.0.0. The issue is fixed from Spectrum Scale 5.0.1.

Whether a file system went through a recovery is determined by the max replication values for the file system. If the mmlsfs -M or -R value is greater than one, then the recovery code is run. The recovery actions are asynchronous and GPFS continues its processing while the recovery attempts take place. The results from the recovery actions and any errors that are encountered are recorded in the GPFS logs.

Two more parameters are available for fine-tuning the recovery process:

mmchconfig metadataDiskWaitTimeForRecovery=seconds
mmchconfig dataDiskWaitTimeForRecovery=seconds

The default value for metadataDiskWaitTimeForRecovery is 1800 seconds. The default value for dataDiskWaitTimeForRecovery is 3600 seconds.

See the following command description in the IBM Spectrum Scale: Command and Programming Reference:

mmchconfig