The blogosphere has quieted down a bit over the two papers on MTBF estimates for Disk Drive Modules (DDM).One article on SearchStorage.com by Arun Taneja asksIs RAID passé?
Disk capacity is growing at a faster rate than DDM reliability. During the hours to rebuild a DDM, companies are at risk of additional failures that could require recovery from a copy, or result in data loss, depending on how well your Business Continuity (BC) plan is written and followed.
I'll discuss two comments in particular.
Joerg Hallbauer felt I did not address all the issues raised:
... The problem with that is that it's the DISK ARRAY that determines when a drive has failed an starts the rebuild process. That IS under the control of IBM, specifically the controller. But more importantly, it effects my risk of data loss.
As I see it, my risk of data loss with RAID-5 is influenced by two main factors. 1 - The drive replacement rate and 2 - The rebuild time (which to a great extent is a function of the drive size) both of which IBM has some control over.
So, I think that the question in my mind is, what's the tipping point? Where does the risk of using RAID-5 protection exceed what I'm willing to accept, and I need to move to some other protection mechanism like RAID-6? Is it when the rebuild times exceed 12 hours? 24 hours? 48 hours?
Also, I wonder why IBM isn't publishing some information to help me make these kinds of decisions?
Bill Todd felt I was not technical enough:
Oh, dear - while Tony doesn’t seem to be parrying vigorously (as Seagate, Hitachi, and Chunk were doing), his contribution sounds more like IBM marketing than the kind of detailed, technical response one might have hoped for
... well, he *is* a manager, and a marketing one at that, so perhaps we shouldn’t expect more).
Both are fair comments. Disk arrays do run microcode to assist or perform the RAID function, detect failures and start the rebuild process, and so clever designs to support spare disks, process the rebuild quickly, and so on, can differentiate one vendor's offering from another.
On the issue of what does IBM provide to help its clients make the right decisions for their environments, Jon William Toigo at DrunkenData points his readers to IBM's Business Continuity Self-Assessment tool. In normal data center conditions, DDMs will fail, and a Business Continuity plan shouldbe written and developed to handle this fact. Using 2-site and 3-site mirroring, complemented with versions of tape backups, can help address some of these concerns and mitigate some of the risks involved with using disk systems.
For those who want a more technical answer, IBM has just published a series of IBM Redbooks.
Each client's situation is different, so no simple answer is possible. However, IBM does have a lot of experience in this area, and would be glad to help you write or update your existing Business Continuity plan.
technorati tags: IBM, disk, MTBF, estimates, papers, Arun Taneja, Jon William Toigo, StorageMojo, Business Continuity, Redbooks