Naughty Drives, Nice Rebuild
Jack of Maryland 120000CGTT Visits (1164)
The excitement is mounting on this end! My family is ready for Christmas to start. I do hope the kids sleep in a bit and allow us to start the celebration closer to 0800 rather than 0500 tomorrow.
Back on the work front, I'm still on duty today. Somebody has to pay for the gifts tomorrow, and I'm not counting on Santa to do it.
In my last entry to this blog I talked a bit about availability being an important attribute of an enterprise array, following up on the question posed by Robin Harris.
I'm going to cover a couple of techniques used by our DS8870 and XIV storage arrays. The DS8870 uses a fairly clever technique called "Smart Drive Rebuild" or also called simply "Smart Rebuild". The DS8000 RAS microcode examines each spinning disk in the DS8870 twice a day. In a RAID 5 array, if the number of errors exceeds a threshold, the data from the offending disk is copied to a spare disk, and then the source disk is retired. The advantage here is that two precious resources are conserved: (1) I/Os on the other six or seven drives in the array that would be required to determine the parity information so the failing drive could be rebuilt, (2) cycles in the POWER processors that would be used to do the reverse parity computation. The really good news is that the more you need it, the better it works. I say that because the technique gives results that are much better than a normal RAID rebuild when the system is under a heavy load and you need every I/O and CPU cycle that you can get. You can see a graph to this effect on Jim Kelly's blog. So what happens if the drive fails during this process? Simple - the DS8870 reverts to a normal RAID 5 rebuild.
Now you're thinking that this can't have any applicability to the XIV - it's doesn't have the normal RAID 5 scheme in its repertoire. True - but it also collects data on disk errors, much like the DS8870, and makes a similar determination on when it thinks a disk is going to fail in the near future. As you may know, the XIV always has two copies of the data scattered throughout the system, in 1MB chunks. When it detects that a drive has gone over the threshold, it makes a THIRD copy of the data scattered throughout the system. Upon the completion of this process the source drive is "administratively failed" (taken offline). The net effect here is that there is even more redundancy when a drive exhibits bad behavior. And we wouldn't want anything bad on our record when Santa is watching, would we?