I had an interesting discussion with one of my colleagues yesterday about what happens after a drive has failed in a regular RAID array. This does not apply to DRAID.
So the sequence of events is basically as follows (obviously step 4 can happen at any time between steps 1 and 5):
- Drive fails
- RAID starts a rebuild to restore redundancy onto a spare
- Rebuild completes
- The failed drive gets replaced with a new one
- The system start a process called "member exchange" which copies all of the data off of the spare drive back onto the brand new drive - whilst both the spare drive and the new drive are in the array.
So the discussion was about whether customers really want step 5 to happen. Hopefully it's obvious that step 5 is not at all mandatory.
As one of the people involved in making the decision to actually have a step 5 - I'm for it. My colleague was against it - especially for Nearline drives.
The reason for this blog post is that if you decide you agree with my comment - you can actually prevent step 5 from occurring - although it's a bit of an overhead on you. I have put the arguments as I have seen them below - but please let me know in the comments what you think we should be doing:
Pros for Step 5
A number of customers care about where there arrays are in the enclosures. They like to be able to walk up to the front of the machine and intuitively know that the first 12 drives in the enclosure are all in a single RAID array. This is helpful when you are considering servicing components, because you can "just know" whether two drives in the system are likely to be in the array, without having to load up the GUI.
Cons for Step 5
Step 5 doesn't add any redundancy to the system, and it can take many hours to complete (again especially if the drive is a Nearline drive). During this time your system is running with one fewer spare drives.
If you don't care which enclosure slots are being used for which arrays - this step 5 is entirely a waste of system resources. If the new drive was simply marked as a spare and not used until the next drive failure you would be happy.
How to manually avoid step 5
The first thing to say is that you shouldn't do this if you are using the "Balanced RAID 10" arrays. This has the potential to remove the balancing from the array.
If you've read this and decide that you just don't want step 5 to run - you can tell the Storwize code to re-configure the array so that it will now prefer to use the "current" drive slots, rather than the drive slots that were being used before the drive failure. The rough steps are:
- Wait for the rebuild to complete
- Run the command
charray -balanced <array ID or name>
This tells the system to reconfigure to use the current enclosure slots
- Now replace the failed drive in the same way you always have.