Preparing for disk failures

Edit online

Because your data is spread across your disks, it is important that you consider how to protect your data if one of those disks fails. Disk protection helps ensure the availability of data stored on the disks.

Disk storage is the storage that is either internal to your system or is attached to it. This disk space, together with your system's main memory, is regarded by your system as one large storage area. When you save a file, you do not assign it to a storage location; instead, the system places the file in the location that ensures the best performance. It might spread the data in the file across multiple disk units. When you add more records to the file, the system assigns additional space on one or more disk units. This way of addressing storage is known as single-level storage.

In addition to internal disk storage, you can also use IBM® System Storage® products to attach a large volume of external disk units. These storage products provide enhanced disk protection, the ability to copy data quickly and efficiently to other storage servers, and the capability of assigning multiple paths to the same data to eliminate connection failures. For additional information about IBM System Storage products and to determine whether this solution is right for you, see disk storage Link outside Information Center .

Another way of preparing for disk failures is by using DS8000® Full System HyperSwap®, which is part of IBM PowerHA® for i Express® Edition. DS8000 Full System HyperSwap is a single system solution that uses two IBM System Storage Servers and does not require a cluster. IBM i provides the ability for the system to switch between the DS8000 servers for planned and unplanned storage side outages without losing access to the data during the switch. For more information about DS8000 Full System HyperSwap, see PowerHA data replication technologies.

Device parity protection

Device parity protection allows your system to continue to operate when a disk fails or is damaged. When you use device parity protection, the disk input/output adapter (IOA) calculates and saves a parity value for each bit of data. The IOA computes the parity value from the data at the same location on each of the other disk units in the device parity set. When a disk failure occurs, the data can be reconstructed by using the parity value and the values of the bits in the same locations on the other disks. Your system continues to run while the data is being reconstructed.

The IBM i supports two types of device parity protection:

RAID 5

With RAID 5, the system can continue to operate if one disk fails in a parity set. If more than one disk fails, data will be lost and you must restore the data for the entire system (or only the affected disk pool) from the backup media. Logically, the capacity of one disk is dedicated to storing parity data in a parity set consisting of 3 to 18 disk units.

RAID 6

With RAID 6, the system can continue to operate if one or two disks fail in a parity set. If more than two disk units fail, you must restore the data for the entire system (or only the affected disk pool) from the backup media. Logically, the capacity of two disk units is dedicated to storing parity data in a parity set consisting of 4 to 18 disk units.

RAID 10

With RAID 10, the system can continue to operate if one of the disk units in the pair should fail, the other unit in the set would be able to sustain the functions of the failed unit. If both units fail, you must restore the data for the entire system (or only the affected disk pool) from the backup media.

RAID 10 is expected to provide significant performance advantages over the RAID 5 and RAID 6 protection. RAID 10 can only be started on internal storage IOAs, which support the function.

Write cache and auxiliary write cache IOA

When the system sends a write operation, the data is first written to the write cache on the disk IOA and then later written to the disk. If the IOA experiences a failure, the data in the cache might be lost and cause an extended outage to recover the system.

The auxiliary write cache is an additional IOA that has a one-to-one relationship with a disk IOA. The auxiliary write cache protects against extended outages due to the failure of a disk IOA or its cache by providing a copy of the write cache which can be recovered following the repair of the disk IOA. This avoids a potential system reload and gets the system back online as soon as the disk IOA is replaced and the recovery procedure completes. However, the auxiliary write cache is not a failover device and cannot keep the system operational if the disk IOA (or its cache) fails.

Hot-spare disks

A disk designated as a hot-spare disk is used when another disk that is part of a parity set on the same IOA fails. It joins the parity set and rebuilding the data for this disk is started by the IOA without user intervention. Because the rebuild operation occurs without having to wait for a new disk to be installed, the time that the parity set is exposed is greatly reduced. See Hot spare protection for additional information

Mirrored protection

Disk mirroring is recommended to provide the best system availability and the maximum protection against disk-related component failures. Data is protected because the system keeps two copies of the data on two separate disk units. When a disk-related component fails, the system can continue to operate without interruption by using the mirrored copy of the data until the failed component is repaired.

Different levels of mirrored protection are possible, depending on what hardware is duplicated. The level of mirrored protection determines whether the system keeps running when different levels of hardware fail. To understand these different levels of protection, see Determining the level of mirrored protection that you want.

You can duplicate the following disk-related hardware:

Disk unit
Disk controllers
I/O bus unit
I/O adapter
I/O processors
A bus
Expansion towers

Hot-spare disks

A disk designated as a hot-spare disk is used when another disk that is mirror-protected fails. A hot spare disk unit is stored on the system as a non-configured disk. When a disk failure occurs, the system exchanges the hot spare disk unit with the failed disk unit. The exchange of a mirrored subunit with the hot spare disk unit does not occur until mirror-protection has been suspended for 5 minutes and the replacement disk has been formatted. After the exchange occurs, the system synchronizes the data on the new disk unit. See Hot spare protection for additional information.

Independent disk pools

With independent disk pools (also called independent auxiliary storage pools), you can prevent certain unplanned outages because the data on them is isolated from the rest of your system. If an independent disk pool fails, your system can continue to operate on data in other disk pools. Combined with different levels of disk protection, independent disk pools provide more control in isolating the effect of a disk-related failure as well as better prevention and recovery techniques.

PowerHA replication technologies

IBM PowerHA SystemMirror® for i product offers different replication technologies, which protect against disk outages. Some examples of these technologies are Geographic mirroring, Metro Mirror, and Global Mirror. Some technologies are a combination of IBM System Storage copy services technology and IBM i clustering technology, which protect against disk outages.

For more information on PowerHA technologies, see PowerHA data replication technologies.

Multipath disk units

You can define up to eight connections from each logical unit number (LUN) created on the IBM System Storage products to the input/output processors (IOPs) on the system. Assigning multiple paths to the same data allows the data to be accessed even though some failures might occur in other connections to the data. Each connection for a multipath disk unit functions independently. Several connections provide availability by allowing disk storage to be used even if a single path fails.