Disk replacement

You can use the ESS GUI for detecting failed disks and for disk replacement.

When one disk fails, the system will rebuild the data that was on the failed disk onto spare space and continue to operate normally, but at slightly reduced performance because the same workload is shared among fewer disks. With the default setting of two spare disks for each large declustered array, failure of a single disk would typically not be a sufficient reason for maintenance.

When several disks fail, the system continues to operate even if there is no more spare space. The next disk failure would make the system unable to maintain the redundancy the user requested during vdisk creation. At this point, a service request is sent to a maintenance management application that requests replacement of the failed disks and specifies the disk FRU numbers and locations.

In general, disk maintenance is requested when the number of failed disks in a declustered array reaches the disk replacement threshold. By default, that threshold is identical to the number of spare disks. For a more conservative disk replacement policy, the threshold can be set to smaller values using the mmchrecoverygroup command.

Disk maintenance is performed using the mmchcarrier command with the --release option, which:

Suspends any functioning disks on the carrier if the multi-disk carrier is shared with the disk that is being replaced.
If possible, powers down the disk to be replaced or all of the disks on that carrier.
Turns on indicators on the disk enclosure and disk or carrier to help locate and identify the disk that needs to be replaced.
If necessary, unlocks the carrier for disk replacement.

After the disk is replaced and the carrier reinserted, another mmchcarrier command with the --replace option powers on the disks.

The ESS GUI automatically detects failed disks and displays events in its event log in the Monitoring > Events view. In addition, the ESS GUI provides a guided procedure you can use to replace disks by selecting the Run fix procedure ... action on the related event.