Distributed array properties

The properties of a distributed array determine the configuration attributes of the array.

Distributed array configurations create large-scale internal MDisks. These arrays, which can contain 4 - 128 drives, also contain rebuild areas that are used to maintain redundancy after a drive fails. If not enough drives are available on the system (for example, in configurations with fewer than four flash drives), you cannot configure a distributed array. Distributed RAID arrays solve rebuild bottlenecks in nondistributed array configurations because rebuild areas are distributed across all the drives in the array. The rebuild write workload is spread across all the drives rather just the single spare drive which results in faster rebuilds on an array. Distributed arrays remove the need for separate drives that are idle until a failure occurs. Instead of allocating one or more drives as spares, the spare capacity is distributed over specific rebuild areas across all the member drives. Data can be copied faster to the rebuild area and redundancy is restored much more rapidly. Additionally, as the rebuild progresses, the performance of the pool is more uniform because all of the available drives are used for every volume extent. After the failed drive is replaced, data is copied back to the drive from the distributed spare capacity. Unlike hot spares drives, read/write requests are processed on other parts of the drive that are not being used as rebuild areas. The number of rebuild areas is based on the width of the array. The size of the rebuild area determines how many times the distributed array can recover failed drives without risking becoming degraded. For example, a distributed array that uses RAID 6 drives can handle two concurrent failures. After the failed drives have been rebuilt, the array can tolerate another two drive failures. If all of the rebuild areas are used to recover data, the array becomes degraded on the next drive failure. Verify that your model supports distributed arrays before completing array configuration.

Supported RAID levels

The system supports the following RAID levels for distributed arrays:

Distributed RAID 5: Distributed RAID 5 arrays stripe data over the member drives with one parity strip on every stripe. These distributed arrays can support 4 - 128 drives. RAID 5 distributed arrays can tolerate only one failed member drive.
Distributed RAID 6: Distributed RAID 6 arrays stripe data over the member drives with two parity strips on every stripe. These distributed arrays can support 6 - 128 drives. A RAID 6 distributed array can tolerate any two concurrent member drive failures.

Example of a distributed array

Figure 1 shows an example of a distributed array that is configured with RAID level 6; all of the drives in the array are active. The rebuild areas are distributed across all of the drives and the drive count includes all of the drives.

1 An active drive
2 Rebuild areas, which are distributed across all drives
3 Drive count, which includes all drives
4 Stripes of data (two stripes are shown)
5 Stripe width
6 Pack, which equals the drive count that is multiplied by stripe width
7 Additional packs in the array (not shown)

This figure shows an example of a distributed array with a RAID 6 level configuration. — Figure 1. Distributed array (RAID 6 level)

Figure 2 shows a distributed array that contains a failed drive. To recover data, the data is read from multiple drives. The recovered data is then written to the rebuild areas, which are distributed across all of the drives in the array. The remaining rebuild areas are distributed across all drives.

1 Failed drive
2 Rebuild areas, which are distributed across all drives
3 Remaining rebuild areas rotate across each remaining drive
4 Additional packs in the array (not shown)

This figure shows an example of a distributed array with a RAID 6 level configuration that has a failed drive. — Figure 2. Distributed array (RAID 6 level) with a failed drive

Array width

The array width, which is also referred to as the drive count, indicates the total number of drives in a distributed array. This total includes the number of drives that are used for data capacity and parity, and the rebuild area that is used to recover data.

Rebuild area

The rebuild area is the disk capacity that is reserved within a distributed array to regenerate data after a drive failure; it provides no usable capacity. Unlike a nondistributed array, the rebuild area is distributed across all of the drives in the array. As data is rebuilt during the copyback process, the rebuild area contributes to the performance of the distributed array because all of the volumes perform I/O requests.

Stripe and stripe width

A stripe, which can also be referred to as a redundancy unit, is the smallest amount of data that can be addressed. For distributed arrays, the stripe size can be 128 or 256 KiB.

The stripe width indicates the number of stripes of data that can be written at one time when data is regenerated after a drive fails. This value is also referred to as the redundancy unit width. In Figure 1, the stripe width of the array is 5.

Drive class

To enhance performance of a distributed array, all of the drives must come from the same, or superior, drive class. Each drive class is identified by its drive_class_id. The system uses the following information to determine the drive class of each drive:

Block size

Indicates the block size of the drive class. Valid block size is either 512 or 4096.

Capacity

Indicates the capacity of the drive class.

I/O group

Indicates the I/O group name that is associated with the drive class.

RPM speed

Indicates the speed of the drive class. Valid RPM speed can be 7.2 K, 10 K, or 15 K. For SSDs, this value is blank.

Technology

Indicates the technology for the drive class. The following technology types are supported:

Tier 0 flash: Tier 0 flash drives are high-performance flash drives that process read and write operations and provide faster access to data than enterprise or nearline drives. For most Tier 0 flash drives, as they are used the system monitors their wear level and issues warnings when the drive is nearing replacement. Drives that use NVMe architecture are considered Tier 0 flash drives.
Tier 1 flash: Tier 1 flash drives are lower-cost flash drives, typically with larger capacities, but slightly lower performance and write endurance characteristics. As these drives are used, the system monitors their wear level and issues warnings when the drive is nearing replacement.
Enterprise disks: Enterprise disks are disk drives that are optimized for performance.
Nearline disks: Nearline disks are disk drives that are optimized for capacity.

Transport protocol

Indicates the transport protocol of the drive. The possible values are either SAS or NVMe protocols.

Compressed

Indicates whether the drive is self-compressing. Self-compressing drives are only supported on systems the use NVMe as the transport protocol.

Physical Capacity

For compressed drives, this value represents the total amount of physical capacity on the drive. This value can be smaller than the logical capacity presented by the capacity value.

For non-compressed drives, the physical capacity is the same as the logical capacity.

To replace a failed member drive in the distributed array, the system can use another drive that has the same drive class as the failed drive. The system can also select a drive from a superior drive class. For example, two drive classes can contain drives of the same technology type but different data capacities. In this case, the superior drive class is the drive class that contains the higher capacity drives.

To display information about all of the drive classes that are available on the system, use the lsdriveclass command.

Example output from the lsdriveclass command shows four drive classes on the system. Drive class 209 contains drives with a capacity of 278.9 GB; drive class 337 contains drives with a capacity of 558.4 GB. Although the drives have the same RPM speed, technology type, and block size, drive class 337 is considered to be superior to drive class 209.

Example output from the lsdriveclass command

id RPM   capacity IO_group_id IO_group_name tech_type       block_size candidate_count superior_count total_count transport_protocol compressed
0        7.0TB    0           io_grp0       tier0_flash     4096       6               6              6           nvme               no
1        20.0TB   0           io_grp0       tier0_flash     512        2               2              8           nvme               yes
2        744.7GB  0           io_grp0       tier0_flash     512        2               2              2           sas                no
3        1.7TB    0           io_grp0       tier1_flash     512        2               2              2           sas                no

Slow write priority settings

When a redundant array is doing read/write I/O operations, the performance of the array is bound by the performance of the slowest member drive. If the SAS network is unstable or if too much work is being driven to the array when drives do internal ERP processes, performance to member drives can be far worse than usual. In this situation, arrays that offer redundancy can accept a short interruption to redundancy to avoid writing to, or reading from, the slow component. Writes that are mapped to a poorly performing drive are committed to the other copy or parity, and are then completed with good status (assuming no other failures). When the member drive recovers, the redundancy is restored by a background process of writing the strips that were marked out of sync while the member was slow.

This technique is governed by the setting of the slow_write_priority attribute of the distributed array, which defaults to latency when the array is created. When set to latency, the array is allowed to become out of sync in an attempt to smooth poor member performance. You can use the charray command to change the slow_write_priority attribute to redundancy. When set to redundancy, the array is not allowed to become out of sync. However, the array can avoid suffering read performance loss by returning reads to the slow component from redundant paths.

When the array uses latency mode or attempts to avoid reading a component that is in redundancy mode, the system evaluates the drive regularly to assess when it becomes a reliable part of the system again. If the drive never offers good performance or causes too many performance failures in the array, the system fails the hardware to prevent ongoing exposure to the poor-performing drive. The system fails the hardware only if it cannot detect another explanation for the bad performance from the drive.

Distributed drive replacement

If the fault LED light on a drive is lit, the drive is marked as failed and is no longer used in the distributed array. When the system detects that a failed drive was replaced, it automatically removes the failed hardware from the array configuration. If the new drive is suitable (for example, in the same drive class), the system begins a copyback operation to make a rebuild area available in the distributed array. When a new NVMe-attached drive is added to the control enclosure to replace a failed drive, the drive is formatted before it is added into the array. For non-compressing drives, this process takes a few seconds, but for self-compressing drives this process might take a couple of minutes. All NVMe-attached drives must finish formatting before they are added into an existing array or before a new array can be created. When an array is created in the management GUI or by using either the mkarray or mkdistributedarray commands, the action fails and an error is logged if the formatting does not complete after 165 seconds.

Note: The mkarray command is not allowed for IBM FlashCore® Modules.