Understanding IBM Storage Scale Erasure Code Edition fault tolerance

IBM Storage Scale RAID uses erasure codes, which are selected by the user for protecting data and metadata. The selected erasure code, available disk space, and current disk hardware configuration play a role regarding levels of failures can be survived. IBM Storage Scale RAID has a placement algorithm for distributing strips of the erasure code. The placement algorithm is aware of the hardware groupings of disks, for example, storage nodes that are present in IBM Storage Scale Erasure Code Edition system. It attempts to segregate individual strips of an erasure code stripe across as many groups as possible, which allows survival of larger units of concurrent disk failures.

For example, if IBM Storage Scale Erasure Code Edition hardware configuration includes six storage nodes and a vdisk has been created with 4+2p erasure code, there are 6 strips (4 data and 2 parity) for each block, and each strip of the vdisk's stripe can be placed on a separate node. The vdisk can tolerate two random disk failures without data loss. Furthermore, if two random complete storage nodes (potentially several or tens of disks on each node) are failed, the surviving erasure code strips on other nodes would ensure no data loss. It can also tolerate one random disk failure and one random node failure. The following table shows the various number of storage nodes, erasure codes, number of strips per node, and fault tolerance level for that combination of nodes and erasure code.

Note:
  • The real vdisk fault tolerance with different node numbers and RAID code may also be limited by Recovery Group Descriptor (RGD) placement other than the RAID code itself. For actual fault tolerance and recommendations, see Recommendations.
Table 1. An example of erasure code layout and tolerances for various RAID codes on different number of nodes
Nodes Code RAID Layout (strips of a stripe per node) RAID Tolerance (N Nodes, D Disks)*
3 3WayReplication 1,1,1 2N, N+D, 2D
3 4WayReplication 2,1,1 2N, N+D, 3D
3 4+2p 2,2,2 1N, 2D
3 4+3p 3,2,2 1N, 3D
4 3WayReplication 1,1,1,0 2N, N+D, 2D
4 4WayReplication 1,1,1,1 3N, 2N+D, N+2D, 3D
4 4+2p 2,2,1,1 N, 2D
4 4+3p 2,2,2,1 N+D, 3D
5 4+2p 2,1,1,1,1 N, 2D
5 4+3p 2,2,1,1,1 N+D, 3D
6 4+2p 1,1,1,1,1,1 2N, N+D, 2D
6 4+3p 2,1,1,1,1,1 2N, N+D, 3D
6 8+2p 2,2,2,2,1,1 N, 2D
6 8+3p 2,2,2,2,2,1 N+D, 3D
7 8+2p 2,2,2,1,1,1,1 N, 2D
7 8+3p 2,2,2,2,1,1,1 N+D, 3D
8 8+2p 2,2,1,1,1,1,1,1 N, 2D
8 8+3p 2,2,2,1,1,1,1,1 N+D, 3D
9 8+2p 2,1,1,1,1,1,1,1,1 N, 2D
9 8+3p 2,2,1,1,1,1,1,1,1 N+D, 3D
10 8+2p 1,1,1,1,1,1,1,1,1,1 2N, N+D, 2D
10 8+3p 2,1,1,1,1,1,1,1,1,1 2N, N+D, 3D
10 16+2p 2,2,2,2,2,2,2,2,1,1 N, 2D
10 16+3p 2,2,2,2,2,2,2,2,2,1 N+D, 3D
11 8+2p 1,1,1,1,1,1,1,1,1,1,0 2N, N+D, 2D
11 8+3p 1,1,1,1,1,1,1,1,1,1,1 3N, 2N+D, N+2D, 3D
18 16+2p 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 2N,N+D, 2D
18 16+3p 2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 2N, N+D, 3D
19 16+2p 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0 2N, N+D, 2D
19 16+3p 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 3N, 2N+D, N+2D, 3D

IBM Storage Scale Erasure Code Edition discovers the disk hardware groups and their current status automatically and uses this information to rebuild or rebalance the erasure code strips. If the disk hardware configuration changes, for example, if new disks or storage nodes are added to the recovery group, IBM Storage Scale RAID recognizes the change automatically and performs a rebalancing operation in the background. Additionally, a rebuild operation of hardware failure is also cognizant of the hardware groupings. So failed erasure code strips are rebuilt in a manner that is aware of the current disk hardware grouping.

When you plan IBM Storage Scale Erasure Code Edition fault tolerance or maintain IBM Storage Scale Erasure Code Edition system, it is important to understand the hardware failures. The hardware failures might be complete disk failures and latent sector errors. The former is more noticeable while the latter is hidden and easier to be overlooked.

  • Complete disk failures can be detected in any I/O to the disk and impact the whole disk. The outage of storage node, storage adapter, or backplane failures, SAS cable problems, or disk internal faults can lead to complete disk failures. It could be a permanent failure, for example, a dead disk that cannot be read or written due to disk internal faults. It might also be a transient failure, for example, the storage node might come back to service soon with the disks. When it happens, the whole disk is taken out for service. All data are rebuilt to restore better fault tolerance if the disk cannot provide service anymore or come back to service within a period.
  • Latent sector errors are the ones that go undetected until the corresponding disk sectors are accessed. IBM Storage Scale RAID is designed with comprehensive end-to-end data integrity protection and validation to catch and fix such errors. It provides an automatic scrub process in the background. Each block is read and examined within a period, for example, every 14 days. It is fixed if latent sector errors are detected. There is a low possibility for the latest sector errors to occur before the next scrub.

It is highly recommended the fault tolerance is planned with at least one node and one disk to well prepare for both types of failures. For example, a vdisk has been created with one node and one disk fault tolerance. When a node goes down or in maintenance, there is still one disk fault tolerance for latent sector errors. Otherwise, some block might exceed the fault tolerance when both types of failures happen at the same time. Users must also be aware that they cannot always take down the same number of nodes and disks as the fault tolerance and expect IBM Storage Scale Erasure Code Edition can still function normally. For example, another vdisk has been created with two nodes fault tolerance. It does not mean that users can always take down two nodes safely and unconditionally due to the potential latent sector errors. The vdisk tolerates two nodes (or one node and one disk, or two disks) failures. However, the failures might be either complete disk failures or latent sector errors.