Tony Pearson is a Master Inventor and Senior IT Architect for the IBM Storage product line at the
IBM Executive Briefing Center in Tucson Arizona, and featured contributor
to IBM's developerWorks. In 2016, Tony celebrates his 30th year anniversary with IBM Storage. He is
author of the Inside System Storage series of books. This blog is for the open exchange of ideas relating to storage and storage networking hardware, software and services.
(Short URL for this blog: ibm.co/Pearson )
My books are available on Lulu.com! Order your copies today!
Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.
Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.
Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.
Tony Pearson is not a medical doctor, and this blog does not reference any IBM product or service that is intended for use in the diagnosis, treatment, cure, prevention or monitoring of a disease or medical condition, unless otherwise specified on individual posts.
“In times of universal deceit, telling the truth will be a revolutionary act.”
-- George Orwell
Well, it has been over two years since I first covered IBM's acquisition of the XIV company. Amazingly, I still see a lot of misperceptions out in the blogosphere, especially those regarding double drive failures for the XIV storage system. Despite various attempts to [explain XIV resiliency] and to [dispel the rumors], there are still competitors making stuff up, putting fear, uncertainty and doubt into the minds of prospective XIV clients.
Clients love the IBM XIV storage system! In this economy, companies are not stupid. Before buying any enterprise-class disk system, they ask the tough questions, run evaluation tests, and all the other due diligence often referred to as "kicking the tires". Here is what some IBM clients have said about their XIV systems:
“3-5 minutes vs. 8-10 hours rebuild time...”
-- satisfied XIV client
“...we tested an entire module failure - all data is re-distributed in under 6 hours...only 3-5% performance degradation during rebuild...”
-- excited XIV client
“Not only did XIV meet our expectations, it greatly exceeded them...”
In this blog post, I hope to set the record straight. It is not my intent to embarrass anyone in particular, so instead will focus on a fact-based approach.
Fact: IBM has sold THOUSANDS of XIV systems
XIV is "proven" technology with thousands of XIV systems in company data centers. And by systems, I mean full disk systems with 6 to 15 modules in a single rack, twelve drives per module. That equates to hundreds of thousands of disk drives in production TODAY, comparable to the number of disk drives studied by [Google], and [Carnegie Mellon University] that I discussed in my blog post [Fleet Cars and Skin Cells].
Fact: To date, no customer has lost data as a result of a Double Drive Failure on XIV storage system
This has always been true, both when XIV was a stand-alone company and since the IBM acquisition two years ago. When examining the resilience of an array to any single or multiple component failures, it's important to understand the architecture and the design of the system and not assume all systems are alike. At it's core, XIV is a grid-based storage system. IBM XIV does not use traditional RAID-5 or RAID-10 method, but instead data is distributed across loosely connected data modules which act as independent building blocks. XIV divides each LUN into 1MB "chunks", and stores two copies of each chunk on separate drives in separate modules. We call this "RAID-X".
Spreading all the data across many drives is not unique to XIV. Many disk systems, including EMC CLARiiON-based V-Max, HP EVA, and Hitachi Data Systems (HDS) USP-V, allow customers to get XIV-like performance by spreading LUNs across multiple RAID ranks. This is known in the industry as "wide-striping". Some vendors use the terms "metavolumes" or "extent pools" to refer to their implementations of wide-striping. Clients have coined their own phrases, such as "stripes across stripes", "plaid stripes", or "RAID 500". It is highly unlikely that an XIV will experience a double drive failure that ultimately requires recovery of files or LUNs, and is substantially less vulnerable to data loss than an EVA, USP-V or V-Max configured in RAID-5. Fellow blogger Keith Stevenson (IBM) compared XIV's RAID-X design to other forms of RAID in his post [RAID in the 21st Centure].
Fact: IBM XIV is designed to minimize the likelihood and impact of a double drive failure
The independent failure of two drives is a rare occurrence. More data has been lost from hash collisions on EMC Centera than from double drive failures on XIV, and hash collisions are also very rare. While the published worst-case time to re-protect from a 1TB drive failure for a fully-configured XIV is 30 minutes, field experience shows XIV regaining full redundancy on average in 12 minutes. That is 40 times less likely than a typical 8-10 hour window for a RAID-5 configuration.
A lot of bad things can happen in those 8-10 hours of traditional RAID rebuild. Performance can be seriously degraded. Other components may be affected, as they share cache, connected to the same backplane or bus, or co-dependent in some other manner. An engineer supporting the customer onsite during a RAID-5 rebuild might pull the wrong drive, thereby causing a double drive failure they were hoping to avoid. Having IBM XIV rebuild in only a few minutes addresses this "human factor".
In his post [XIV drive management], fellow blogger Jim Kelly (IBM) covers a variety of reasons why storage admins feel double drive failures are more than just random chance. XIV avoids load stress normally associated with traditional RAID rebuild by evenly spreading out the workload across all drives. This is known in the industry as "wear-leveling". When the first drive fails, the recovery is spread across the remaining 179 drives, so that each drive only processes about 1 percent of the data. The [Ultrastar A7K1000] 1TB SATA disk drives that IBM uses from HGST have specified 1.2 million hours mean-time-between-failures [MTBF] would average about one drive failing every nine months in a 180-drive XIV system. However, field experience shows that an XIV system will experience, on average, one drive failure per 13 months, comparable to what companies experience with more robust Fibre Channel drives. That's innovative XIV wear-leveling at work!
Fact: In the highly unlikely event that a DDF were to occur, you will have full read/write access to nearly all of your data on the XIV, all but a few GB.
Even though it has NEVER happened in the field, some clients and prospects are curious what a double drive failure on an XIV would look like. First, a critical alert message would be sent to both the client and IBM, and a "union list" is generated, identifying all the chunks in common. The worst case on a 15-module XIV fully loaded with 79TB data is approximately 9000 chunks, or 9GB of data. The remaining 78.991 TB of unaffected data are fully accessible for read or write. Any I/O requests for the chunks in the "union list" will have no response yet, so there is no way for host applications to access outdated information or cause any corruption.
(One blogger compared losing data on the XIV to drilling a hole through the phone book. Mathematically, the drill bit would be only 1/16th of an inch, or 1.60 millimeters for you folks outside the USA. Enough to knock out perhaps one character from a name or phone number on each page. If you have ever seen an actor in the movies look up a phone number in a telephone booth then yank out a page from the phone book, the XIV equivalent would be cutting out 1/8th of a page from an 1100 page phone book. In both cases, all of the rest of the unaffected information is full accessible, and it is easy to identify which information is missing.)
If the second drive failed several minutes after the first drive, the process for full redundancy is already well under way. This means the union list is considerably shorter or completely empty, and substantially fewer chunks are impacted. Contrast this with RAID-5, where being 99 percent complete on the rebuild when the second drive fails is just as catastrophic as having both drives fail simultaneously.
Fact: After a DDF event, the files on these few GB can be identified for recovery.
Once IBM receives notification of a critical event, an IBM engineer immediately connects to the XIV using remote service support method. There is no need to send someone physically onsite, the repair actions can be done remotely. The IBM engineer has tools from HGST to recover, in most cases, all of the data.
Any "union" chunk that the HGST tools are unable to recover will be set to "media error" mode. The IBM engineer can provide the client a list of the XIV LUNs and LBAs that are on the "media error" list. From this list, the client can determine which hosts these LUNs are attached to, and run file scan utility to the file systems that these LUNs represent. Files that get a media error during this scan will be listed as needing recovery. A chunk could contain several small files, or the chunk could be just part of a large file. To minimize time, the scans and recoveries can all be prioritized and performed in parallel across host systems zoned to these LUNs.
As with any file or volume recovery, keep in mind that these might be part of a larger consistency group, and that your recovery procedures should make sense for the applications involved. In any case, you are probably going to be up-and-running in less time with XIV than recovery from a RAID-5 double failure would take, and certainly nowhere near "beyond repair" that other vendors might have you believe.
Fact: This does not mean you can eliminate all Disaster Recovery planning!
To put this in perspective, you are more likely to lose XIV data from an earthquake, hurricane, fire or flood than from a double drive failure. As with any unlikely disaster, it is best to have a disaster recovery plan than to hope it never happens. All disk systems that sit on a single datacenter floor are vulnerable to such disasters.
For mission-critical applications, IBM recommends using disk mirroring capability. IBM XIV storage system offers synchronous and asynchronous mirroring natively, both included at no additional charge.
In keeping with the spirit to be a more kinder, gentler 2011, I decided last week to refrain from being the rain on someone else's parade that occurs immediately before, during or after a competitor's announcement or annual conference, and let EMC have their few moments in the spotlight last week. This of course allows me more time to learn about the announcements and reflect on marketplace reactions. Here's a quick look at the [EMC Press Release]:
A new VNXe disk system
Of the 41 new storage technologies and products EMC announced last week, the VNXe is EMC's "me-too" product to compete against other low-end disk systems like the IBM System Storage DS3524 and N3000 series. It looks truly new, developed organically from the ground up, with a new architecture, new OS. It comes in either the 2U-high VNXe3100 or the 3U-high VNXe3300. These employ 3.5-inch SAS drives to provide Ethernet-based NFS, CIFS and iSCSI host attachment. The $10K USD price tag appears to be for the hardware only. As is typical for EMC, they charge software features in bundles or "suites", so the actual TCO will be much higher. I have not seen any announcements whether Dell plans to resell either the VNXe nor the VNX models, now that they have acquired Compellent.
A new VNX disk system
Despite having a similar name as the VNXe, the VNX appears to be a re-hash of the Celerra/CLARiiON mess that EMC has been selling already, based on the old FLARE and DART operating systems of these older disk systems. This scales from 75 to 1000 SAS drives. While EMC calls the VNX "unified", it currently is only available in block-only and file-only models, with a future promise from EMC that they will offer a combined block-and-file version sometime in the future. EMC claims that the VNX will be faster than the predecessors, so hopefully that means EMC has joined the rest of the planet and will publish SPC-1 and SPC-2 benchmarks to back up that claim. They can compare against the SPC-1 benchmarks that our friends at NetApp ran against EMC CLARiiON.
New software for the VMAX
A long time ago, EMC announced they would provide non-disruptive automated tiering. Their first delivery "FAST V1" handled entire LUNs at a time. EMC now has finally "FAST VP" which we expected was going to be called "FAST V2", which provides sub-LUN automated tiering between Solid-state and spinning disk drives.. Meanwhile, IBM has been delivering "Easy Tier" on the IBM System Storage DS8000 series, SAN Volume Controller, and Storwize V7000 disk systems.
Data Domain Archiver
Competing against IBM, HP and Oracle in the tape arena, EMC's latest addition to the Data Domain family is designed for the long-term retention of backups? Archives of backups? Backups are short-lived, protecting against the unexpected loss from hardware failure or data corruption. Keeping backups as "archives" is generally a bad mistake, as it makes it hard to e-Discover the data you need when you need it, and may not have the appropriate hardware tor restore these old backups when you do find them.
I will have to dig deeper into all of these different technologies in separate posts in the future.
Continuing coverage of my week in Washington DC for the annual [2010 System Storage Technical University], I attended several XIV sessions throughout the week. There were many XIV sessions. I could not attend all of them. Jack Arnold, one of my colleagues at the IBM Tucson Executive Briefing Center, often presents XIV to clients and Business Partners. He covered all the basics of XIV architecture, configuration, and features like snapshots and migration. Carlos Lizarralde presented "Solving VMware Challenges with XIV". Ola Mayer presented "XIV Active Data Migration and Disaster Recovery".
Here is my quick recap of two in particular that I attended:
XIV Client Success Stories - Randy Arseneau
Randy reported that IBM had its best quarter ever for the XIV, reflecting an unexpected surge shortly after my blog post debunking the DDF myth last April. He presented successful case studies of client deployments. Many followed a familiar pattern. First, the client would only purchase one or two XIV units. Second, the client would beat the crap out of them, putting all kinds of stress from different workloads. Third, the client would discover that the XIV is really as amazing as IBM and IBM Business Partners have told them. Finally, in the fourth phase, the client would deploy the XIV for mission-critical production applications.
A large US bank holding company managed to get 5.3 GB/sec from a pair of XIV boxes for their analytics environment. They now have 14 XIV boxes deployed in mission-critical applications.
A large equipment manufacturer compared the offerings among seven different storage vendors, and IBM XIV came out the winner. They now have 11 XIV boxes in production and another four boxes for development/test. They have moved their entire VMware infrastructure to IBM XIV, running over 12,000 guest instances.
A financial services company bought their first XIV in early 2009 and now has 34 XIV units in production attached to a variety of Windows, Solaris, AIX, Linux servers and VMware hosts. Their entire Microsoft Exchange was moved from HP and EMC disk to IBM XIV, and experienced noticeable performance improvement.
When a University health system replaced two competitive disk systems with XIV, their data center temperature dropped from 74 to 68 degrees Fahrenheit. In general, XIV systems are 20 to 30 percent more energy efficient per usable TB than traditional disk systems.
A service provider that had used EMC disk systems for over 10 years evaluated the IBM XIV versus upgrading to EMC V-Max. The three year total cost of ownership (TCO) of EMC's V-Max was $7 Million US dollars higher, so EMC counter-proposed CLARiiON CX4 instead. But, in the end, IBM XIV proved to be the better fit, and now the customer is happy having made the switch.
The manager of an information communications technology service provider was impressed that the XIV was up and running in just a couple of days. They now have over two dozen XIV systems.
Another XIV client had lost all of their Computer Room Air Conditioning (CRAC) units for several hours. The data center heated up to 126 degrees Fahrenheit, but the customer did not lose any data on either of their two XIV boxes, which continued to run in these extreme conditions.
Optimizing XIV Performance - Brian Cormody
This session was an update from the [one presented last year] by Izhar Sharon. Brian presented various best practices for optimizing the performance when using specific application workloads with IBM XIV disk systems.
Oracle ASM: Many people allocate lots of small LUNs, because this made sense a long time ago when all you had was just a bunch of disks (JBOD). In fact, many of the practices that DBAs use to configure databases across disks become unnecessary with XIV. Wth XIV, you are better off allocating a few number of very large LUNs from the XIV. The best option was a 1-volume ASM pool with 8MB AU stripe. A single LUN can contain multiple Oracle databases. A single LUN can be used to store all of the logs.
VMware: Over 70 percent of XIV customers use it with VMware. For VMFS, IBM recommends allocating a few number of large LUNs. You can specify the maximum of 2181 GB. Do not use VMware's internal LUN extension capability, as IBM XIV already has thin provisioning and works better to allow XIV to do this for you. XIV Snapshots provide crash-consistent copies without all the VMware overhead of VMware Snapshots.
SAP: For planning purposes, the "SAPS" unit equates roughly to 0.4 IOPS for ERP OLTP workloads, and 0.6 IOPS for BW/BI OLAP workloads. In general, an XIV can deliver 25-30,000 IOPS at 10-15 msec response time, and 60,000 IOPS at 30 msec response time. With SAP, our clients have managed to get 60,000 IOPS at less than 15 msec.
Microsoft Exchange: Even my friends in Redmond could not believe how awesome XIV was during ESRP testing. Five Exchange 2010 servers connected two a pair of XIV boxes using the new 2TB drawers managed 40,000 mailboxes at the high profile (0.15 IOPS per mailbox). Another client found four XIV boxes (720 drives) was able to handle 60,000 mailboxes (5GB max), which would have taken over 4000 drives if internal disk drives were used instead. Who said SANs are obsolete for MS Exchange?
Asynchronous Replication: IBM now has an "Async Calculator" to model and help design an XIV async replication solution. In general, dark fiber works best, and MPLS clouds had the worst results. The latest 10.2.2 microcode for the IBM XIV can now handle 10 Mbps at less than 250 msec roundtrip. During the initial sync between locations, IBM recommends setting the "schedule=never" to consume as much bandwidth as possible. If you don't trust the bandwidth measurements your telco provider is reporting, consider testing the bandwidth yourself with [iPerf] open source tool.
Here I am, day 11 of a 17-day business trip, on my last leg of the trip this week, in Kuala Lumpur in Malaysia. I have been flooded with requests to give my take on EMC's latest re-interpretation of storage virtualization, VPLEX.
I'll leave it to my fellow IBM master inventor Barry Whyte to cover the detailed technical side-by-side comparison. Instead, I will focus on the business side of things, using Simon Sinek's Why-How-What sequence. Here is a [TED video] from Garr Reynold's post
[The importance of starting from Why].
Let's start with the problem we are trying to solve.
Problem: migration from old gear to new gear, old technology to new technology, from one vendor to another vendor, is disruptive, time-consuming and painful.
Given that IT storage is typically replaced every 3-5 years, then pretty much every company with an internal IT department has this problem, the exception being those companies that don't last that long, and those that use public cloud solutions. IT storage can be expensive, so companies would like their new purchases to be fully utilized on day 1, and be completely empty on day 1500 when the lease expires. I have spoken to clients who have spent 6-9 months planning for the replacement or removal of a storage array.
A solution to make the data migration non-disruptive would benefit the clients (make it easier for their IT staff to keep their data center modern and current) as well as the vendors (reduce the obstacle of selling and deploying new features and functions). Storage virtualization can be employed to help solve this problem. I define virtualization as "technology that makes one set of resources look and feel like a different set of resources, preferably with more desirable characteristics.". By making different storage resources, old and new, look and feel like a single type of resource, migration can be performed without disrupting applications.
Before VPLEX, here is a breakdown of each solution:
Non-disruptive tech refresh, and a unified platform to provide management and functionality across heterogeneous storage.
Non-disruptive tech refresh, and a unified platform to provide management and functionality between internal tier-1 HDS storage, and external tier-2 heterogeneous storage.
Non-disruptive tech refresh, with unified multi-pathing driver that allows host attachment of heterogeneous storage.
New in-band storage virtualization device
Add in-band storage virtualization to existing storage array
New out-of-band storage virtualization device with new "smart" SAN switches
SAN Volume Controller
HDS USP-V and USP-VM
For IBM, the motivation was clear: Protect customers existing investment in older storage arrays and introduce new IBM storage with a solution that allows both to be managed with a single set of interfaces and provide a common set of functionality, improving capacity utilization and availability. IBM SAN Volume Controller eliminated vendor lock-in, providing clients choice in multi-pathing driver, and allowing any-to-any migration and copy services. For example, IBM SVC can be used to help migrate data from an old HDS USP-V to a new HDS USP-V.
With EMC, however, the motivation appeared to protect software revenues from their PowerPath multi-pathing driver, TimeFinder and SRDF copy services. Back in 2005, when EMC Invista was first announced, these three software represented 60 percent of EMC's bottom-line profit. (Ok, I made that last part up, but you get my point! EMC charges a lot for these.)
Back in 2006, fellow blogger Chuck Hollis (EMC) suggested that SVC was just a [bump in the wire] which could not possibly improve performance of existing disk arrays. IBM showed clients that putting cache(SVC) in front of other cache(back end devices) does indeed improve performance, in the same way that multi-core processors successfully use L1/L2/L3 cache. Now, EMC is claiming their cache-based VPLEX improves performance of back-end disk. My how EMC's story has changed!
So now, EMC announces VPLEX, which sports a blend of SVC-like and Invista-like characteristics. Based on blogs, tweets and publicly available materials I found on EMC's website, I have been able to determine the following comparison table. (Of course, VPLEX is not yet generally available, so what is eventually delivered may differ.)
Scalable, 1 to 4 node-pairs
One size fits all, single pair of CPCs
SVC-like, 1 to 4 director-pairs
Works with any SAN switches or directors
Required special "smart" switches (vendor lock-in)
SVC-like, works with any SAN switches or directors
Broad selection of IBM Subsystem Device Driver (SDD) offered at no additional charge, as well as OS-native drivers Windows MPIO, AIX MPIO, Solaris MPxIO, HP-UX PV-Links, VMware MPP, Linux DM-MP, and comercial third-party driver Symantec DMP.
Limited selection, with focus on priced PowerPath driver
Invista-like, PowerPath and Windows MPIO
Read cache, and choice of fast-write or write-through cache, offering the ability to improve performance.
No cache, Split-Path architecture cracked open Fibre Channel packets in flight, delayed every IO by 20 nanoseconds, and redirected modified packets to the appropriate physical device.
SVC-like, Read and write-through cache, offering the ability to improve performance.
Space-Efficient Point-in-Time copies
SVC FlashCopy supports up to 256 space-efficient targets, copies of copies, read-only or writeable, and incremental persistent pairs.
Like Invista, No
Remote distance mirror
Choice of SVC Metro Mirror (synchronous up to 300km) and Global Mirror (asynchronous), or use the functionality of the back-end storage arrays
No native support, use functionality of back-end storage arrays, or purchase separate product called EMC RecoverPoint to cover this lack of functionality
Limited synchronous remote-distance mirror within VPLEX (up to 100km only), no native asynchronous support, use functionality of back-end storage arrays
Provides thin provisioning to devices that don't offer this natively
Like Invista, No
SVC Split-Cluster allows concurrent read/write access of data to be accessed from hosts at two different locations several miles apart
I don't think so
PLEX-Metro, similar in concept but implemented differently
Non-disruptive tech refresh
Can upgrade or replace storage arrays, SAN switches, and even the SVC nodes software AND hardware themselves, non-disruptively
Tech refresh for storage arrays, but not for Invista CPCs
Tech refresh of back end devices, and upgrade of VPLEX software, non-disruptively. Not clear if VPLEX engines themselves can be upgraded non-disruptively like the SVC.
Heterogeneous Storage Support
Broad support of over 140 different storage models from all major vendors, including all CLARiiON, Symmetrix and VMAX from EMC, and storage from many smaller startups you may not have heard of
Invista-like. VPLEX claims to support a variety of arrays from a variety of vendors, but as far as I can find, only DS8000 supported from the list of IBM devices. Fellow blogger Barry Burke (EMC) suggests [putting SVC between VPLEX and third party storage devices] to get the heterogeneous coverage most companies demand.
Back-end storage requirement
Must define quorum disks on any IBM or non-IBM back end storage array. SVC can run entirely on non-IBM storage arrays
HP SVSP-like, requires at least one EMC storage array to hold metadata
SVC 2145-CF8 model supports up to four solid-state drives (SSD) per node that can treated as managed disk to store end-user data
Invista-like. VPLEX has an internal 30GB SSD, but this is used only for operating system and logs, not for end-user data.
In-band virtualization solutions from IBM and HDS dominate the market. Being able to migrate data from old devices to new ones non-disruptively turned out to be only the [tip of the iceberg] of benefits from storage virtualization. In today's highly virtualized server environment, being able to non-disruptively migrate data comes in handy all the time. SVC is one of the best storage solutions for VMware, Hyper-V, XEN and PowerVM environments. EMC watched and learned in the shadows, taking notes of what people like about the SVC, and decided to follow IBM's time-tested leadership to provide a similar offering.
EMC re-invented the wheel, and it is round. On a scale from Invista (zero) to SVC (ten), I give EMC's new VPLEX a six.