The blogosphere has quieted down a bit over the two papers on MTBF estimates for Disk Drive Modules (DDM).One article on SearchStorage.com by Arun Taneja asksIs RAID passé?
Disk capacity is growing at a faster rate than DDM reliability. During the hours to rebuild a DDM, companies are at risk of additional failures that could require recovery from a copy, or result in data loss, depending on how well your Business Continuity (BC) plan is written and followed.
I'll discuss two comments in particular.
Joerg Hallbauer felt I did not address all the issues raised:
... The problem with that is that it's the DISK ARRAY that determines when a drive has failed an starts the rebuild process. That IS under the control of IBM, specifically the controller. But more importantly, it effects my risk of data loss.
As I see it, my risk of data loss with RAID-5 is influenced by two main factors. 1 - The drive replacement rate and 2 - The rebuild time (which to a great extent is a function of the drive size) both of which IBM has some control over.
So, I think that the question in my mind is, what's the tipping point? Where does the risk of using RAID-5 protection exceed what I'm willing to accept, and I need to move to some other protection mechanism like RAID-6? Is it when the rebuild times exceed 12 hours? 24 hours? 48 hours?
Also, I wonder why IBM isn't publishing some information to help me make these kinds of decisions?
Bill Todd felt I was not technical enough:
Oh, dear - while Tony doesn’t seem to be parrying vigorously (as Seagate, Hitachi, and Chunk were doing), his contribution sounds more like IBM marketing than the kind of detailed, technical response one might have hoped for
... well, he *is* a manager, and a marketing one at that, so perhaps we shouldn’t expect more).
Both are fair comments. Disk arrays do run microcode to assist or perform the RAID function, detect failures and start the rebuild process, and so clever designs to support spare disks, process the rebuild quickly, and so on, can differentiate one vendor's offering from another.
On the issue of what does IBM provide to help its clients make the right decisions for their environments, Jon William Toigo at DrunkenData points his readers to IBM's Business Continuity Self-Assessment tool. In normal data center conditions, DDMs will fail, and a Business Continuity plan shouldbe written and developed to handle this fact. Using 2-site and 3-site mirroring, complemented with versions of tape backups, can help address some of these concerns and mitigate some of the risks involved with using disk systems.
For those who want a more technical answer, IBM has just published a series of IBM Redbooks.
Each client's situation is different, so no simple answer is possible. However, IBM does have a lot of experience in this area, and would be glad to help you write or update your existing Business Continuity plan.
technorati tags: IBM, disk, MTBF, estimates, papers, Arun Taneja, Jon William Toigo, StorageMojo, Business Continuity, Redbooks
Tuesday is always good for announcements. Today, Gartner, Inc.
announced that IBM has taken over HP in its climb to the top. I'll quote directly from today's press release:
STAMFORD, Conn., March 6, 2007 — Worldwide external controller-based (ECB) disk storage revenue totaled $15.2 billion in 2006, a 4.1 percent increase over 2005 revenue of $14.6 billion, according to Gartner, Inc.IBM overtook Hewlett-Packard for the No. 2 position in 2006 (see Table 1). IBM’s worldwide ECB market share increased to 15.8 percent, while HP’s market share dropped to 13.1 percent.
IBM beat HP both in 4Q06, as well as 2006 full year.You can read more about it from Gartner Dataquest report “Market Share: Disk Array Storage, All Regions, All Countries, 1Q05-4Q06" on their website. (Note: non-IBMers might need an account with Gartner to access this, not sure)
The focus was on external controller-based disk, not external controller-less SCSI/SAS disk, not disk arrays posing as virtual tape libraries, nor any disk sold inside HP, Sun, IBM or Dell servers. This is to compare with disk-only vendors such as EMC and HDS. The revenues reflect hardware only, including hardware-related parts of financial leases and managed services. Revenues from optional priced software features such as multi-pathing drivers, management software, or advanced copy services were excluded.I discussed these types of analyst reports back in blog post last September: Space Race Heats Up.
These marketshare numbers are based on revenues, not units or terabytes. When a box gets sold, the revenue was counted toward the vendor that sold it, not the manufacturer that built it. In this last report:
- When Dell sells an EMC box, it gets counted as Dell. When Fujitsu Siemens sells an EMC box, it gets counted as "Other".
- When HP sells an HDS box, it gets counted as HP. When Sun sells the HDS box, it gets counted as Sun.
- When IBM sells its System Storage N series (from the OEM agreement with NetApp), it gets counted as IBM. Both IBM and NetApp experienced growth in the NAS/unified storage arena.
It's still cold here in the Washington DC area, but at least good news like this helps warm me up!
technorati tags: IBM, disk, external controller-based, ECB, Gartner, 4Q06, 2006, revenue, marketshare, HP, EMC, Sun, Dell, NetApp, HDS, NAS
Well, this week I am in Maryland, just outside of Washington DC. It's a bit cold here.
Robin Harris over at StorageMojo put out this Open Letter to Seagate, Hitachi GST, EMC, HP, NetApp, IBM and Sun about the results of two academic papers, one from Google, and another from Carnegie Mellon University (CMU). The papers imply that the disk drive module (DDM) manufacturers have perhaps misrepresented their reliability estimates, and asks major vendors to respond. So far, NetAppand EMC have responded.
I will not bother to re-iterate or repeat what others have said already, but make just a few points. Robin, you are free to consider this "my" official response if you like to post it on your blog, or point to mine, whatever is easier for you. Given that IBM no longer manufacturers the DDMs we use inside our disk systems, there may not be any reason for a more formal response.
- Coke and Pepsi buy sugar, Nutrasweet and Splenda from the same sources
Somehow, this doesn't surprise anyone. Coke and Pepsi don't own their own sugar cane fields, and even their bottlers are separate companies. Their job is to assemble the components using super-secret recipes to make something that tastes good.
IBM, EMC and NetApp don't make DDMs that are mentioned in either academic study. Different IBM storage systems uses one or more of the following DDM suppliers:
- Seagate (including Maxstor they acquired)
- Hitachi Global Storage Technologies, HGST (former IBM division sold off to Hitachi)
In the past, corporations like IBM was very "vertically-integrated", making every component of every system delivered.IBM was the first to bring disk systems to market, and led the major enhancements that exist in nearly all disk drives manufactured today. Today, however, our value-add is to take standard components, and use our super-secret recipe to make something that provides unique value to the marketplace. Not surprisingly, EMC, HP, Sun and NetApp also don't make their own DDMs. Hitachi is perhaps the last major disk systems vendor that also has a DDM manufacturing division.
So, my point is that disk systems are the next layer up. Everyone knows that individual components fail. Unlike CPUs or Memory, disks actually have moving parts, so you would expect them to fail more often compared to just "chips".
If you don't feel the MTBF or AFR estimates posted by these suppliers are valid, go after them, not the disk systems vendors that use their supplies. While IBM does qualify DDM suppliers for each purpose, we are basically purchasing them from the same major vendors as all of our competitors. I suspect you won't get much more than the responses you posted from Seagate and HGST.
- American car owners replace their cars every 59 months
According to a frequently cited auto market research firm, the average time before the original owner transfers their vehicle -- purchased or leased -- is currently 59 months.Both studies mention that customers have a different "definition" of failure than manufacturers, and often replace the drives before they are completely kaput. The same is true for cars. Americans give various reasons why they trade in their less-than-five-year cars for newer models. Disk technologies advance at a faster pace, so it makes sense to change drives for other business reasons, for speed and capacity improvements, lower power consumption, and so on.
The CMU study indicated that 43 percent of drives were replaced before they were completely dead.So, if General Motors estimated their cars lasted 9 years, and Toyota estimated 11 years, people still replace them sooner, for other reasons.
At IBM, we remind people that "data outlives the media". True for disk, and true for tape. Neither is "permanent storage", but rather a temporary resting point until the data is transferred to the next media. For this reason, IBM is focused on solutions and disk systems that plan for this inevitable migration process. IBM System Storage SAN Volume Controller is able to move active data from one disk system to another; IBM Tivoli Storage Manager is able to move backup copies from one tape to another; and IBM System Storage DR550 is able to move archive copies from disk and tape to newer disk and tape.
If you had only one car, then having that one and only vehicle die could be quite disrupting. However, companies that have fleet cars, like Hertz Car Rentals, don't wait for their cars to completely stop running either, they replace them well before that happens. For a large company with a large fleet of cars, regularly scheduled replacement is just part of doing business.
This brings us to the subject of RAID. No question that RAID 5 provides better reliability than having just a bunch of disks (JBOD). Certainly, three copies of data across separate disks, a variation of RAID 1, will provide even more protection, but for a price.
Robin mentions the "Auto-correlation" effect. Disk failures bunch up, so one recent failure might mean another DDM, somewhere in the environment, will probably fail soon also. For it to make a difference, it would (a) have to be a DDM in the same RAID 5 rank, and (b) have to occur during the time the first drive is being rebuilt to a spare volume.
- The human body replaces skin cells every day
So there are individual DDMs, manufactured by the suppliers above; disk systems, manufactured by IBM and others, and then your entire IT infrastructure. Beyond the disk system, you probably have redundant fabrics, clustered servers and multiple data paths, because eventually hardware fails.
People might realize that the human body replaces skin cells every day. Other cells are replaced frequently, within seven days, and others less frequently, taking a year or so to be replaced. I'm over 40 years old, but most of my cells are less than 9 years old. This is possible because information, data in the form of DNA, is moved from old cells to new cells, keeping the infrastructure (my body) alive.
Our clients should approach this in a more holistic view. You will replace disks in less than 3-5 years. While tape cartridges can retain their data for 20 years, most people change their tape drives every 7-9 years, and so tape data needs to be moved from old to new cartridges. Focus on your information, not individual DDMs.
What does this mean for DDM failures. When it happens, the disk system re-routes requests to a spare disk, rebuilding the data from RAID 5 parity, giving storage admins time to replace the failed unit. During the few hours this process takes place, you are either taking a backup, or crossing your fingers.Note: for RAID5 the time to rebuild is proportional to the number of disks in the rank, so smaller ranks can be rebuilt faster than larger ranks. To make matters worse, the slower RPM speeds and higher capacities of ATA disks means that the rebuild process could take longer than smaller capacity, higher speed FC/SCSI disk.
According to the Google study, a large portion of the DDM replacements had no SMART errors to warn that it was going to happen. To protect your infrastructure, you need to make sure you have current backups of all your data. IBM TotalStorage Productivity Center can help identify all the data that is "at risk", those files that have no backup, no copy, and no current backup since the file was most recently changed. A well-run shop keeps their "at risk" files below 3 percent.
So, where does that leave us?
- ATA drives are probably as reliable as FC/SCSI disk. Customers should chose which to use based on performance and workload characteristics. FC/SCSI drives are more expensive because they are designed to run at faster speeds, required by some enterprises for some workloads. IBM offers both, and has tools to help estimate which products are the best match to your requirements.
- RAID 5 is just one of the many choices of trade-offs between cost and protection of data. For some data, JBOD might be enough. For other data that is more mission critical, you might choose keeping two or three copies. Data protection is more than just using RAID, you need to also consider point-in-time copies, synchronous or asynchronous disk mirroring, continuous data protection (CDP), and backup to tape media. IBM can help show you how.
- Disk systems, and IT environments in general, are higher-level concepts to transcend the failures of individual components. DDM components will fail. Cache memory will fail. CPUs will fail. Choose a disk systems vendor that combines technologies in unique and innovative ways that take these possibilities into account, designed for no single point of failure, and no single point of repair.
So, Robin, from IBM's perspective, our hands are clean. Thank you for bringing this to our attention and for giving me the opportunity to highlight IBM's superiority at the systems level.
technorati tags: IBM, Seagate, Hitachi, HGST, EMC, NetApp, HP, HDS, Sun, Google, CMU, DDM, Fujitsu, MTBF, MTTF, AFR, ARR, JBOD, RAID, Tivoli, SVC, DR550, CDP, FC, SCSI, disk, tape, SAN,
Sometimes, it's difficult to explain the products I manage to people outside the IT storage industry. How do you explain FCP vs. FICON, Giant Magnetoresistive (GMR) heads, the SMI-S interface, etc. enough to then explain how your job relates to those technologies. At least my friends and family read this blog, so they can somewhat understand some of the things I am working on. When I visit my folks on Sundays, we sometimes discuss items they read in my blog that week.
In addition to a "take your children to work day", we have discussed within IBM a "take your parents to work day", especially for the young new hires who have a hard time explaining what their new job is to the rest of their family.
Seth Godin points to a video ad to fill a job position and the confusion therein with the "recruiter" who just doesn't understand the job involved.
The problem is not just your parents, but any of your co-workers old enough to be parents who haven't bothered to keep up with the latest advancements in Web 2.0 technology. Here are some examples:
- A project leader working with a technology partner asked if me if there was a difference between a "blog" and a "wiki" and which should his team use. This was not a simple yes/no answer, and involved some explanation, conversation and understanding of what he was trying to accomplish.
- For one of my meetings, someone instant-messaged me asking where it was, was it "face-to-face" (F2F) or Conference call (CC). I replied back, "A2A w/CC" (avatar-to-avatar with voice over conference call). When you are meeting other avatars in-world in Second Life, it gets quite distracting having everyone typing away, with their hands and fingers moving furiously, so we use a conference call to complement our 3D interaction.
That's why I was very excited to seeLinden Lab announces voice beta in Second Life. It won't be fully ready until later this year, but adding voice to Second Life will greatly reduce the hurdles we now have trying to coordinate conference calls with in-world activity.
I realize not everyone can keep up with all the new and different technologies, but the social networking aspects of some of these new developments are worth looking into.
technorati tags: IBM, blog, wiki, social networking, technology, avatar, voice, Second Life, GMR, FICON, FCP, SMI-S