The BP oil spill in the Gulf of Mexico is a good reminder that all organizations should consider practice and execution of their contingency plans. In this most recent case, the [Deepwater Horizon] oil platform had an explosion on April 20, resulting in oil spewing out at an estimated 19,000 barrels per day. While some bloggers have argued that BP failed to plan, and therefore planned to fail, I found that hard to believe. How can a billion-dollar multinational company not have contingency plans?
The truth is, BP did have plans. Karen Dalton Beninato of New Orleans' City Voices discusses BP's Gulf of Mexico Regional Oil Spill Response Plan (OSRP) in her article [BP's Spill Plan: What they knew and when they knew it]. A
[redacted 90-page version of the OSRP] is available on their website.
The plan indicates that it may be 30 days from the time a deep offshore leak reaches the shoreline, giving OSRP participants plenty of time to take action.
(Having former politicians [blame environmentalists] for this crisis does not help much either. At least the deep shore rigs give you 30 days to react to a leak before the oil gets to the shoreline. Having oil rigs closer to shore will just shorten this time to react. Allowing onshore oil rigs does not mean oil companies would discontinue their deep offshore operations. There are thousands of oil rigs in the Gulf of Mexico. Extracting oil in the beautiful Alaska National Wildlife Reserve [ANWR] might be safer, it does not eliminate the threat entirely, and any leak there would be damaging to the local plant and animals in the same manner.)
So perhaps the current crisis was not the result of a lack of planning, but inadequate practice and execution. The same is true for IT Business Continuity / Disaster Recovery (BC/DR) plans. In all cases, there are four critical parts:
The planning team needs to anticipate every possible incident, determine the risks involved and the likelihood of impact, and either accept them, or decide to mitigate them. This can include natural disasters (hurricanes, fires, floods) and technical issues (computer viruses, power outages, network disruption).
Mitigation can involve taking backups, having replicated copies at a remote location, creating bootable media, training all of the appropriate employees, and having written documented procedures. IBM's Unified Recovery Management approach can protect your entire IT operations, from laptops of mobile employees, to remote office/branch office (ROBO) locations, to regional and central data centers.
When was the last time you practiced your Business Continuity / Disaster Recovery plan? I have seen this done at a variety of levels. At the lowest level, it is all done on paper, in a conference room, with all participants talking through their respective actions. These are often called "walk-throughs". At the highest level, you turn off power to your data center --on a holiday weekend to minimize impact to operating revenues-- and have the team bring up applications at the alternate site.
As many as 80 percent of these BC/DR exercises are considered failures, in that if a real disaster would have occurred, the participants are convinced they would not have achieved their target goals of Recovery Time Objective (RTO). However, they are not complete failures if they can help improve the plans, help identify new incidents that were not previously considered, and help train the participants in recovery procedures.
The last part is execution. In my career, I have been onsite for many Disaster Recovery exercises as well as after real disasters have occured. I am not surprised how many people assume that if they have plans in place, have made preparations, and have one to three practice drills per year, that the actual "execution" would directly follow. While the book [Execution] by Bossidy and Charan is not focused on IT BC/DR plans per se, it is a great read on how to manage the actual execution of any kind of business plan. I have read this book and recommend it.
If you have not tested out your IT department's BC/DR plans. Perhaps its time to dust off your copy, review it, and schedule some time for practice.
technorati tags: , Gulf of Mexico, Deepwater Horizon, oil spill, OSRP, Business Continuity, Disaster Recovery
“In times of universal deceit, telling the truth will be a revolutionary act.”
-- George Orwell
Well, it has been over two years since I first covered IBM's acquisition of the XIV company. Amazingly, I still see a lot of misperceptions out in the blogosphere, especially those regarding double drive failures for the XIV storage system. Despite various attempts to [explain XIV resiliency] and to [dispel the rumors], there are still competitors making stuff up, putting fear, uncertainty and doubt into the minds of prospective XIV clients.
Clients love the IBM XIV storage system! In this economy, companies are not stupid. Before buying any enterprise-class disk system, they ask the tough questions, run evaluation tests, and all the other due diligence often referred to as "kicking the tires". Here is what some IBM clients have said about their XIV systems:
“3-5 minutes vs. 8-10 hours rebuild time...”
-- satisfied XIV client
“...we tested an entire module failure - all data is re-distributed in under 6 hours...only 3-5% performance degradation during rebuild...”
-- excited XIV client
“Not only did XIV meet our expectations, it greatly exceeded them...”
-- delighted XIV client
In this blog post, I hope to set the record straight. It is not my intent to embarrass anyone in particular, so instead will focus on a fact-based approach.
- Fact: IBM has sold THOUSANDS of XIV systems
XIV is "proven" technology with thousands of XIV systems in company data centers. And by systems, I mean full disk systems with 6 to 15 modules in a single rack, twelve drives per module. That equates to hundreds of thousands of disk drives in production TODAY, comparable to the number of disk drives studied by [Google], and [Carnegie Mellon University] that I discussed in my blog post [Fleet Cars and Skin Cells].
- Fact: To date, no customer has lost data as a result of a Double Drive Failure on XIV storage system
This has always been true, both when XIV was a stand-alone company and since the IBM acquisition two years ago. When examining the resilience of an array to any single or multiple component failures, it's important to understand the architecture and the design of the system and not assume all systems are alike. At it's core, XIV is a grid-based storage system. IBM XIV does not use traditional RAID-5 or RAID-10 method, but instead data is distributed across loosely connected data modules which act as independent building blocks. XIV divides each LUN into 1MB "chunks", and stores two copies of each chunk on separate drives in separate modules. We call this "RAID-X".
Spreading all the data across many drives is not unique to XIV. Many disk systems, including EMC CLARiiON-based V-Max, HP EVA, and Hitachi Data Systems (HDS) USP-V, allow customers to get XIV-like performance by spreading LUNs across multiple RAID ranks. This is known in the industry as "wide-striping". Some vendors use the terms "metavolumes" or "extent pools" to refer to their implementations of wide-striping. Clients have coined their own phrases, such as "stripes across stripes", "plaid stripes", or "RAID 500". It is highly unlikely that an XIV will experience a double drive failure that ultimately requires recovery of files or LUNs, and is substantially less vulnerable to data loss than an EVA, USP-V or V-Max configured in RAID-5. Fellow blogger Keith Stevenson (IBM) compared XIV's RAID-X design to other forms of RAID in his post [RAID in the 21st Centure].
- Fact: IBM XIV is designed to minimize the likelihood and impact of a double drive failure
The independent failure of two drives is a rare occurrence. More data has been lost from hash collisions on EMC Centera than from double drive failures on XIV, and hash collisions are also very rare. While the published worst-case time to re-protect from a 1TB drive failure for a fully-configured XIV is 30 minutes, field experience shows XIV regaining full redundancy on average in 12 minutes. That is 40 times less likely than a typical 8-10 hour window for a RAID-5 configuration.
A lot of bad things can happen in those 8-10 hours of traditional RAID rebuild. Performance can be seriously degraded. Other components may be affected, as they share cache, connected to the same backplane or bus, or co-dependent in some other manner. An engineer supporting the customer onsite during a RAID-5 rebuild might pull the wrong drive, thereby causing a double drive failure they were hoping to avoid. Having IBM XIV rebuild in only a few minutes addresses this "human factor".
In his post [XIV drive management], fellow blogger Jim Kelly (IBM) covers a variety of reasons why storage admins feel double drive failures are more than just random chance. XIV avoids load stress normally associated with traditional RAID rebuild by evenly spreading out the workload across all drives. This is known in the industry as "wear-leveling". When the first drive fails, the recovery is spread across the remaining 179 drives, so that each drive only processes about 1 percent of the data. The [Ultrastar A7K1000] 1TB SATA disk drives that IBM uses from HGST have specified 1.2 million hours mean-time-between-failures [MTBF] would average about one drive failing every nine months in a 180-drive XIV system. However, field experience shows that an XIV system will experience, on average, one drive failure per 13 months, comparable to what companies experience with more robust Fibre Channel drives. That's innovative XIV wear-leveling at work!
- Fact: In the highly unlikely event that a DDF were to occur, you will have full read/write access to nearly all of your data on the XIV, all but a few GB.
Even though it has NEVER happened in the field, some clients and prospects are curious what a double drive failure on an XIV would look like. First, a critical alert message would be sent to both the client and IBM, and a "union list" is generated, identifying all the chunks in common. The worst case on a 15-module XIV fully loaded with 79TB data is approximately 9000 chunks, or 9GB of data. The remaining 78.991 TB of unaffected data are fully accessible for read or write. Any I/O requests for the chunks in the "union list" will have no response yet, so there is no way for host applications to access outdated information or cause any corruption.
(One blogger compared losing data on the XIV to drilling a hole through the phone book. Mathematically, the drill bit would be only 1/16th of an inch, or 1.60 millimeters for you folks outside the USA. Enough to knock out perhaps one character from a name or phone number on each page. If you have ever seen an actor in the movies look up a phone number in a telephone booth then yank out a page from the phone book, the XIV equivalent would be cutting out 1/8th of a page from an 1100 page phone book. In both cases, all of the rest of the unaffected information is full accessible, and it is easy to identify which information is missing.)
If the second drive failed several minutes after the first drive, the process for full redundancy is already well under way. This means the union list is considerably shorter or completely empty, and substantially fewer chunks are impacted. Contrast this with RAID-5, where being 99 percent complete on the rebuild when the second drive fails is just as catastrophic as having both drives fail simultaneously.
- Fact: After a DDF event, the files on these few GB can be identified for recovery.
Once IBM receives notification of a critical event, an IBM engineer immediately connects to the XIV using remote service support method. There is no need to send someone physically onsite, the repair actions can be done remotely. The IBM engineer has tools from HGST to recover, in most cases, all of the data.
Any "union" chunk that the HGST tools are unable to recover will be set to "media error" mode. The IBM engineer can provide the client a list of the XIV LUNs and LBAs that are on the "media error" list. From this list, the client can determine which hosts these LUNs are attached to, and run file scan utility to the file systems that these LUNs represent. Files that get a media error during this scan will be listed as needing recovery. A chunk could contain several small files, or the chunk could be just part of a large file. To minimize time, the scans and recoveries can all be prioritized and performed in parallel across host systems zoned to these LUNs.
As with any file or volume recovery, keep in mind that these might be part of a larger consistency group, and that your recovery procedures should make sense for the applications involved. In any case, you are probably going to be up-and-running in less time with XIV than recovery from a RAID-5 double failure would take, and certainly nowhere near "beyond repair" that other vendors might have you believe.
- Fact: This does not mean you can eliminate all Disaster Recovery planning!
To put this in perspective, you are more likely to lose XIV data from an earthquake, hurricane, fire or flood than from a double drive failure. As with any unlikely disaster, it is best to have a disaster recovery plan than to hope it never happens. All disk systems that sit on a single datacenter floor are vulnerable to such disasters.
For mission-critical applications, IBM recommends using disk mirroring capability. IBM XIV storage system offers synchronous and asynchronous mirroring natively, both included at no additional charge.
For more about IBM XIV reliability, read this whitepaper [IBM XIV© Storage System: Reliability Reinvented]. To find out why so many clients LOVE their XIV, contact your local IBM storage sales rep or IBM Business Partner.
technorati tags: IBM, XIV, DDF, RAID-5, RAID-10, RAID-X, RAID-6, RAID-DP, HP, EVA, HDS, USP-V, EMC, CLARiiON, V-Max, Disaster Recovery, HGST, UltraStar, A7K1000
I'm down here in Australia, where the government is a bit stalled for the past two weeks at the moment, known formally as being managed by the [Caretaker government]. Apparently, there is a gap between the outgoing administration and the incoming administration, and the caretaker government is doing as little as possible until the new regime takes over. They are still counting votes, including in some cases dummy ballots known as "donkey votes", the Australian version of the hanging chad. Three independent parties are also trying to decide which major party they will support to finalize the process.
While we are on the topic of a government stalled, I feel bad for the state of Virginia in the United States. Apparently, one of their supposedly high-end enterprise class EMC Symmetrix DMX storage systems, supporting 26 different state agencies in Virginia, crashed on August 25th and now more than a week later, many of those agencies are still down, including the Department of Motor Vehicles and the Department of Taxation and Revenue.
Many of the articles in the press on this event have focused on what this means for the reputation of EMC. Not surprisingly, EMC says that this failure is unprecedented, but really this is just one in a long series of failures from EMC. It reminds me of the last time EMC had a public failure with a dual-controller CLARiiON a few months ago that stopped another company from their operations. There is nothing unique in the physical equipment itself, all IT gear can break or be taken down by some outside force, such as a natural disaster. The real question, though, is why haven’t EMC and the State Government been able to restore operations many days after the hardware was fixed?
In the Boston Globe, Zeus Kerravala, a data storage analyst at Yankee Group in Boston, is quoted as saying that such a high-profile breakdown could undermine EMC’s credibility with large businesses and government agencies. “I think it’s extremely important for them,’’ said Kerravala. “When you see a failure of this magnitude, and their inability to get a customer like the state of Virginia up and running almost immediately, all companies ought to look at that and raise their eyebrows.’’
Was the backup and disaster recovery solution capable of the scale and service level requirements needed by vital state
agencies? Had they tested their backups to ensure they were running correctly, and had they tested their recovery plans? Were they monitoring the success of recent backup operations?
Eventually, the systems will be back up and running, fines and penalties will be paid, and perhaps the guy who chose to go with EMC might feel bad enough to give back that new set of golf clubs, or whatever ridiculously expensive gift EMC reps might offer to government officials these days to influence the purchase decision making process.
(Note: I am not accusing any government employee in particular working at the state of Virginia of any wrongdoing, and mention this only as a possibility of what might have happened. I am sure the media will dig into that possibility soon enough during their investigations, so no sense in me discussing that process any further.)
So what lessons can we learn from this?
- Lesson 1: You don't just buy technology, you also are choosing to work with a particular vendor
IBM stands behind its products. Choosing a product strictly on its speeds and feeds misses the point. A study IBM and Mercer Consulting Group conducted back in 2007 found that only 20 percent of the purchase decision for storage was from the technical capabilities. The other 80 percent were called "wrapper attributes", such as who the vendor was, their reputation, the service, support and warranty options.
- Lesson 2: Losing a single disk system is a disaster, so disaster recovery plans should apply
IBM has a strong Business Continuity and Recovery Services (BCRS) services group to help companies and government agencies develop their BC/DR plans. In the planning process, various possible incidents are identified, recovery point objectives (RPO) and recovery time objectives (RTO) and then appropriate action plans are documentede on how to deal with them. For example, if the state of Virginia had an RPO of 48 hours, and an RTO of 5 days, then when the failure occurred on August 25, they could have recovered up to August 23 level data(48 hours prior to the incident) and be up and running by August 30 (five days after the incident). I don't personally know what RPO and RTO they planned for, but certainly it seems like they missed it by now already.
- Lesson 3: BC/DR Plans only work if you practice them often enough
Sadly, many companies and government agencies make plans, but never practice them, so they have no idea if the plans will work as expected, or if they are fundamentally flawed. Just as we often have fire drills that force everyone to stop what they are doing and vacate the office building, anyone with an IT department needs to practice BC/DR plans often enough so that you can ensure the plan itself is solid, but also so that the people involved know what to do and their respective roles in the recovery process.
- Lesson 4: This can serve as a wake-up call to consider Cloud Computing as an alternative option
Are you still doing IT in your own organization? Do you feel all of the IT staff have been adequately trained for the job? If your biggest disk system completely failed, not just a minor single or double drive failure, but a huge EMC-like failure, would your IT department know how to recover in less than five days? Perhaps this will serve as a wake-up call to consider alternative IT delivery options. The advantage of big Cloud Service Providers (Microsoft, Google, Yahoo, Amazon, SalesForce.com and of course, IBM) is that they are big enough to have worked out all the BC/DR procedures, and have enough resources to switch over to in case any individual disk system fails.
To learn more on this event, see the following articles: Washington
technorati tags: EMC, Symmetrix, DMX, Failure, Virginia, State Government, DMV, IBM, Business Continuity, Disaster Recovery
Continuing my coverage of the [IBM System x and System Storage Technical Symposium], here is a recap of Day 2:
- XIV: Implementation, Migration and Optimization
Special thanks to Anthony Vandewerdt, who sent me his version of this presentation that he planned to present in Australia next week. I "smartened it up" (or whatever the appropriate phrase is the opposite of "dumbed it down") for the technical audience.
- Recovery procedures for single and double drive failures. A double drive failure on an XIV typically involves less recovery effort than traditional RAID5-based disk systems, and in many cases results in no data loss whatsoever. I provided details on this in my blog post [Double Drive Failure Debunked: XIV Two Years Later], so no need to repeat myself here.
- Replacing the Automatic Transfer Switch (ATS) non-disruptively. To support either single-phase and triple-phase power sources, the XIV uses an ATS to take two independent power feeds, and distribute this out to the three Uninterruptible Power Supplies (UPS).
- Built-in Migration capability to copy data off other disk systems over to the XIV.
- Configuring Synchronous and Asynchronous mirroring using either the Fibre Channel or Internet Protocol ports.
- Optimizing the use of XIV for VMware, AIX and other operating systems.
The IBM XIV Storage System is quite popular in New Zealand, with four times more boxes sold per capita than the other countries in the Asia Pacific region. I covered both the A14 model as well as the new Gen3 model.
- Business Continuity/Disaster Recovery (BC/DR) Update: Lessons, Planning, Solutions
My colleague Vic Peltz from IBM Almaden presented on lessons learned from Hurricane Katrina and various other natural disasters. Unlike tradtional presentations the focus on technology, Vic took a different approach, focusing on people and procedures. I was here last year when the earthquake hit Christchurch on the south island, so I was well aware that BC/DR was top of mind for many of the attendees. Throughout this week, I have felt tremors, and many of the locals told me that these happen all the time.
- Introduction to IBM Storwize V7000
I knew I was in trouble when the request for me to present Storwize sounded like something from [Mission Impossible]:
"Good morning, Mr. Pearson. Your mission, should you choose to accept it, involves presenting Storwize V7000 in Auckland, New Zealand. You may also present the Storwize V7000 Unified, but it is essential that you not cover the SAN Volume Controller or SONAS products from which they are based upon, as you will not have enough time. The audience is very technical, so be careful. As always, should any questions come up that you cannot answer, the conference coordinators will disavow all knowledge of your actions, nor reimburse your laundry charges. This message will self-destruct in five seconds."
Well, I accomplished my mission in 75 minutes. I was able to cover the block-only version of the IBM Storwize V7000, with support for clustering the control enclosures, expansion drawers and external storage virtualization. I then spent a few minutes on the block-and-file Storwize V7000 Unified, which adds support for CIFS, NFS, HTTPS, FTP and SCP protocols through two new "file modules", with integrated support for backup and anti-virus checking. I covered both IBM Easy Tier for sub-LUN automated tiering between Solid-State Drives (SSD) and spinning disk, as well as Active Cloud Engine for file-based movement between disk and tape.
- SAN Best Practices
Rich Swain presented IBM's best practices for deploying a Storage Area Network. This was an [updated version of the one from Orlando] by Jim Blue from IBM's SAN Central team.
technorati tags: IBM, System x, Storage Symposium, Auckland, New Zealand, XIV, BC/DR, Business Continuity, Disaster Recovery, Storwize V7000, V7000 Unified, SAN, best practices
Continuing my coverage of the [IBM Storage Innovation Executive Summit], that occurred May 9 in New York City, this is my third in a series of blog posts on this event.
During lunch, people were able to take a look at our solutions. Here are Dan Thompson and Brett Cooper striking a pose.
- Hyper-Efficient Backup and Recovery
The afternoon was kicked off by Dr. Daniel Sabbah, IBM General Manager of Tivoli software. He started with some shocking statistics: 42 percent of small companies have experienced data loss, 32 percent have lost data forever. IBM has a solution that offers "Unified Recovery Management". This involves a combination of periodic backups, frequent snapshots, and remote mirroring.
IBM Tivoli Storage Manager (TSM) was introduced in 1993, and was the first backup software solution to support backup to disk storage pools. Today, TSM is now also part of Cloud Computing services, including IBM Information Protection Services. IBM announced today a new bundle called IBM Storwize Rapid Application Backup, which combines IBM Storwize V7000 midrange disk system, Tivoli FlashCopy Manager, implementation services, with a full three-year hardware and software warranty. This could be used, for example, to protect a Microsoft Exchange email system with 9000 mailboxes.
IBM also announced that its TS7600 ProtecTIER data deduplication solutions have been enhanced to support many-to-many bi-direction remote mirroring. Last year, University of Pittsburgh Medical Center (UPMC) reported that they were average 24x data deduplication factor in their environment using IBM ProtecTIER.
"You are out of your mind if you think you can live without tape!"
-- Dick Crosby, Director of System Administration, Estes
The new IBM TS1140 enterprise class tape drive process 2.3 TB per hour, and provides a density of 1.2 PB per square foot. The new 3599 tape media can hold 4TB of data uncompressed, which could hold up to 10TB at a 2.5x compression ratio.
The United States Golfers Association [USGA] uses IBM's backup cloud, which manages over 100PB of data from 750 locations across five continents.
- Customer Testimonial - Graybar
Randy Miller, Manager of Technical System Administration at Graybar, provided the next client testimonial. Graybar is an employee-owned company focused on supply-chain management, serving as a distributor for electical, lighting, security, power and cooling equipment.
Their problem was that they had 240 different locations, and expecting local staff to handle tape backups was not working out well. They centralized their backups to their main data center. In the event that a system fails in one of their many remote locations, they can rebuild a new machine at their main data center across high-speed LAN, and then ship overnight to the remote location. The result, the remote location has a system up and running by 10:30am, faster than they would have had from local staff trying to figure out how to recover from tape. In effect, Graybar had implemented a "private cloud" for backup in the 1990s, long before the concept was "cool" or "popular".
In 2001, they had an 18TB SAP ERP application data repository. To back this up, they took it down for 1 minute per day, six days a week, and 15 minutes down on Sundays. The result was less than 99.8 percent availability. To fix this, they switched to XIV, and use Snapshots that are non-disruptive and do not impact application performance.
Over 85 percent of the servers at Graybar are virtualized.
Their next challenge is Disaster Recovery. Currently, they have two datacenters, one in St. Louis and the other in Kansas City. However, in the aftermath of Japan's earthquakes, they realize there is a nuclear power plan between their two locations, so a single incident could impact both data centers. They are working with IBM, their trusted advisors, to investigate a three-site solution.
This week, May 15-22, I am in Auckland, New Zealand teaching IBM Storage Top Gun sales class. Next week, I will be in Sydney, Australia.
technorati tags: IBM, summit, NYC, Daniel Sabbah, TSM, Storwize, , TS7600, ProtecTIER, TS1140, tape, USGA, Graybar, Randy Miller, SAP, ERP, Disaster Recovery, New Zealand, Australia, Top Gun