Tony Pearson is a Master Inventor and Senior IT Architect for the IBM Storage product line at the
IBM Executive Briefing Center in Tucson Arizona, and featured contributor
to IBM's developerWorks. In 2016, Tony celebrates his 30th year anniversary with IBM Storage. He is
author of the Inside System Storage series of books. This blog is for the open exchange of ideas relating to storage and storage networking hardware, software and services.
(Short URL for this blog: ibm.co/Pearson )
My books are available on Lulu.com! Order your copies today!
Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.
Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.
Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.
Tony Pearson is not a medical doctor, and this blog does not reference any IBM product or service that is intended for use in the diagnosis, treatment, cure, prevention or monitoring of a disease or medical condition, unless otherwise specified on individual posts.
The new [IBM System Storage Tape Controller 3592 Model C07] is an upgrade to the previous C06 controller. Like the C06, the new 3592-C07 can have up to four FICON (4Gbps) ports, four FC ports, and connect up to 16 drives. The difference is that the C07 supports 8Gbps speed FC ports, and can support the [new TS1140 tape drives that were announced on May 9]. A cool feature of the C07 is that it has a built-in library manager function for the mainframe. On the previous models, you had to have a separate library manager server.
Crossroads ReadVerify Appliance (3222-RV1)
IBM has entered an agreement to resell [Crossroads ReadVerify Appliance], or "RV1" for short. The RV1 is a 1U-high server with software that gathers information on the utilization, performance and health for a physical tape environment, such as an IBM TS3500 Tape Library. The RV1 also offers a feature called "ArchiveVerify" which validates long-term retention archive tapes, providing an audit trail on the readability of tape media. This can be useful for tape libraries attached behind IBM Information Archive compliance storage solution, or the IBM Scale-Out Network Attached Storage (SONAS).
As an added bonus, Crossroads has great videos! Here's one, titled [Tape Sticks]
Linear Tape File System (LTFS) Library Edition Version 2.1
While the hardware is all refreshed, the overall "scale-out" architecture is unchanged. Kudos to the XIV development team for designing a system that is based entirely on commodity hardware, allowing new hardware generations to be introduced with minimal changes to the vast number of field-proven software features like thin provisioning, space-efficient read-only and writeable snapshots, synchronous and asynchronous mirroring, and Quality of Service (QoS) performance classes.
The new XIV Gen3 features an Infiniband interconnect, faster 8Gbps FC ports, more iSCSI ports, faster motherboard and processors, SAS-NL 2TB drives, 24GB cache memory per XIV module, all in a single frame IBM rack that supports the IBM Rear Door Heat Exchanger. The results are a 2x to 4x boost in performance for various workloads. Here are some example performance comparisons:
Disclaimer: Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here. Your mileage may vary.
In a Statement of Direction, IBM also has designed the Gen3 modules to be "SSD-ready" which means that you can insert up to 500GB of Solid-State drive capacity per XIV module, up to 7.5TB in a fully-configured 15 module frame. This SSD would act as an extension of DRAM cache, similar to how Performance Accelerator Modules (PAM) on IBM N series.
IBM will continue to sell XIV Gen2 systems for the next 12-18 months, as some clients like the smaller 1TB disk drives. The new Gen3 only comes with 2TB drives. There are some clients that love the XIV so much, that they also use it for less stringent Tier 2 workloads. If you don't need the blazing speed of the new Gen3, perhaps the lower cost XIV Gen2 might be a great fit!
As if I haven't said this enough times already, the IBM XIV is a Tier-1, high-end, enterprise-class disk storage system, optimized for use with mission critical workloads on Linux, UNIX and Windows operating systems, and is the ideal cost-effective replacement for EMC Symmetrix VMAX, HDS USP-V and VSP, and HP P9000 series disk systems, . Like the XIV Gen2, the XIV Gen3 can be used with IBM System i using VIOS, and with IBM System z mainframes running Linux, z/VM or z/VSE. If you run z/OS or z/TPF with Count-Key-Data (CKD) volumes and FICON attachment, go with the IBM System Storage DS8000 instead, IBM's other high-end disk system.
Continuing my post-week coverage of the [Data Center 2010 conference], Thursday morning had some interesting sessions for those that did not leave town last night.
Interactive Session Results
In addition to the [Profile of Data Center 2010] that identifies the demographics of this year's registrants, the morning started with highlights of the interactive polls during the week.
External or Heterogeneous Storage Virtualization
The analyst presented his views on the overall External/Heterogeneous Storage Virtualization marketplace. He started with the key selling points.
Avoid vendor lock-in. Unlike the IBM SAN Volume Controller, many of the other storage virtualization products result in vendor lock-in.
Leverage existing back-end capacity. Limited to what back-end storage devices are supported.
Simplify and unify management of storage. Yes, mostly.
Lower storage costs. Unlike the IBM SAN Volume Controller, many using other storage virtualization discover an increase in total storage costs.
Migration tools. Yes, as advertised.
Consolidation/Transition. Yes, over time.
Better functionality. Potentially.
Shortly after several vendors started selling external/heterogeneous storage virtualization solutions, either as software or pre-installed appliances, major storage vendors that were caught with their pants down immediately started calling everything internally as also "storage virtualization" to buy some time and increase confusion.
While the analyst agreed that storage virtualization simplifies the view of storage from the host server side, it can complicate the management of storage on the storage end. This often comes up at the Tucson Briefing Center. I explain this as the difference between manual and automatic transmission cars. My father was a car mechanic, and since he is the sole driver and sole mechanic, he prefers manual transmission cars, easier to work on. However, rental car companies, such as Hertz or Avis, prefer automatic transmission cars. This might require more skills on behalf of their mechanics, but greatly simplifies the experience for those driving.
The analyst offered his views on specific use cases:
Data Migration. The analyst feels that external virtualization serves as one of the best tools for data migration. But what about tech refresh of the storage virtualization devices themselves? Unlike IBM SAN Volume Controller, which allows non-disruptive upgrades of the nodes themselves, some of the other solutions might make such upgrades difficult.
Consolidation/Transition. External virtualization can also be helpful, depending on how aggressive the schedule for consolidation/transition is performed.
Improved Functionality/Usability. IBM SAN Volume Controller is a good example, an unexpected benefit. Features like thin provisioning, automated storage tiering, and so on, can be added to existing storage equipment.
The analyst mentioned that there were different types of solutions. The first category were those that support both internal storage and external storage virtualization, like the HDS USP-V or IBM Storwize V7000. He indicated that roughly 40 percent of HDS USP-V are licensed for virtualization. The second category were those that support external virtualization only, such as IBM SAN Volume Controller, HP Lefthand and SVSP, and so on. The third category were software-only Virtual Guest images that could provide storage virtualization capabilities.
The analyst mentioned EMC's failed product Invista, which sold less than 500 units over the past five years. The low penetration for external virtualization, estimated between 2-5 percent, could be explained from the bad taste that left in everyone considering their options. However, the analyst predicts that by 2015, external virtualization will reach double digit marketshare.
Having a feel for the demographics of the registrants, and specific interactive polling in each meeting, provides a great view on who is interested in what topic, and some insight into their fears and motivations.
This week, Hitachi Ltd. announced their next generation disk storage virtualization array, the Virtual Storage Platform, following on the success of its USP V line. It didn't take long for fellow blogger Chuck Hollis (EMC) to comment on this in his blog post [Hitachi's New VSP: Separating The Wheat From The Chaff]. Here are some excerpts:
"Well, we all knew that Hitachi (through HDS and HP) would be announcing some sort of refresh to their high-end storage platform sooner or later.
As EMC is Hitachi's only viable competitor in this part of the market, I think people are expecting me to say something.
If you're a high-end storage kind of person, your universe is basically a binary star: EMC and Hitachi orbiting each other, with the interesting occasional sideshow from other vendors trying to claim relevance in this space."
Chuck implies that neither Hewlett-Packard (HP) nor Hitachi Data Systems (HDS) as vendors provide any value-add from the box manufactured by Hitachi Ltd. so combines them into a single category. I suspect the HP and HDS folks might disagree with that opinion.
When I reminded Chuck that IBM was also a major player in the high-end disk space, his response included the following gem:
"Many of us in the storage industry believe that IBM currently does not field a competitive high-end storage platform. IDC market share numbers bear out this assertion, as you probably know."
While Chuck is certainly entitled to his own beliefs and opinions, believing the world is flat does not make it so. Certainly, I doubt IDC or any other market research firm has put out a survey asking "Do you think IBM offers a competitive high-end disk storage platform?" Of course, if Chuck is basing his opinion on anecdotal conversations with existing EMC customers, I can certainly see how he might have formed this misperception. However, IDC market share numbers don't support Chuck's assertion at all.
There is no industry-standard definition of what is a "high-end" or "enterprise-class" disk system. Some define high-end as having the option for mainframe attachment via ESCON and/or FICON protocol. Others might focus on features, functionality, scalability and high 99.999+ percent availability. Others insist high-end requires block-oriented protocols like FC and iSCSI, rather than file-based protocols like NAS and CIFS.
For the most demanding mission-critical mix of random and sequential workloads, IBM offers the [IBM System Storage DS8000 series] high-end disk system which connects to mainframes and distributed servers, via FCP and FICON attachment, and supports a variety of drive types and RAID levels. The features that HP and HDS are touting today for the VSP are already available on the IBM DS8000, including sub-LUN automatic tiering between Solid-State drives and spinning disk, called [Easy Tier], thin provisioning, wide striping, point-in-time copies, and long distance synchronous and asynchronous replication.
There are lots of analysts that track market share for the IT storage industry, but since Chuck mentions [IDC] specifically, I reviewed the most recent IDC data, published a few weeks ago in their "IDC Worldwide Quarter Disk Storage Tracker" for 2Q 2010, representing April 1 to June 30, 2010 sales. Just in case any of the rankings have changed over time, I also looked at the previous four quarters: 2Q 2009, 3Q 2009, 4Q 2009 and 1Q 2010.
(Note: IDC considers its analysis proprietary, out of respect for their business model I will not publish any of the actual facts and figures they have collected. If you would like to get any of the IDC data to form your own opinion, contact them directly.)
In the case of IDC, they divide the disk systems into three storage classes: entry-level, midrange and high-end. Their definition of "high-end" is external RAID-protected disk storage that sells for $250,000 USD or more, representing roughly 25 to 30 percent of the external disk storage market overall. Here are IDC's rankings of the four major players for high-end disk systems:
By either measure of market share, units (disk systems) or revenue (US dollars), IDC reports that IBM high-end disk outsold both HDS and HP combined. This has been true for the past five quarters. If a smaller start-up vendor has single digit percent market share, I could accept it being counted as part of Chuck's "occasional sideshow from other vendors trying to claim relevance", but IBM high-end disk has consistently had 20 to 30 percent market share over the past five quarters!
Not all of these high-end disk systems are connected to mainframes. According to IDC data, only about 15 to 25 percent of these boxes are counted under their "Mainframe" topology.
Chuck further writes:
"It's reasonable to expect IBM to sell a respectable amount of storage with their mainframes using a protocol of their own design -- although IBM's two competitors in this rather proprietary space (notably EMC and Hitachi) sell more together than does IBM."
The IDC data doesn't support that claim either, Chuck. By either measure of market share, units (disk systems) or revenue (US dollars), IDC reports that IBM disk for mainframes outsold all other vendors (including EMC, HDS, and HP) combined. And again, this has been true for the past five quarters. Here is the IDC ranking for mainframe disk storage:
IBM has over 50 percent market share in this case, primarily because IBM System Storage DS8000 is the industry leader in mainframe-related features and functions, and offers synergy with the rest of the z/Architecture stack.
So Chuck, I am not picking a fight with you or asking you to retract or correct your blog post. Your main theme, that the new VSP presents serious competition to EMC's VMAX high-end disk arrays, is certainly something I can agree with. Congratulations to HDS and HP for putting forth what looks like a viable alternative to EMC's VMAX.
To learn more about IBM's upcoming products, register for next week's webcast "Taming the Information Explosion with IBM Storage" featuring Dan Galvan, IBM Vice President, and Steve Duplessie, Senior Analyst and Founder of Enterprise Storage Group (ESG).
Last week, in Computer Technology Review's article [Tiering: Scale Up? Scale Out? Do Both], Mark Ferelli interviews fellow blogger Hu Yoshida, CTO of Hitachi Data Systems (HDS). Here's an excerpt:
"MF/CTR: A global cache should be required to implement that common pool that you’re talking about going across all tiers.
Hu/HDS: Right. So that is needed to get to all the resources. Now with our system, we can also attach external storage behind it for capacity so that as the storage ages out or becomes less active we can move it to the external storage. They would certainly have less performance capability, but you don’t need it for the stale data that we’re aging down. Right now we’re the only vendor that can provide this type of tiering.
If you look at other people who do virtualization like IBM’s SVC, the SVC has no storage within it because it’s sitting so if you attach any storage behind it, there is some performance degradation because you have this appliance sitting in front. That appliance is also very limited in cache and very limited in the number of storage boards on it. It cannot really provide you additional performance than what is attached behind it. And in fact, it will always degrade what is attached behind it because it’s not storage, where as our USP is storage and it has a global cache and it has thousands of port connections, load balancing and all that. So our front end can enhance existing storage that sits behind it."
This is not the first time I have had to correct Hu and others of misperceptions of IBM's SAN Volume Controller (SVC). This month marks my four year "blogoversary", and I seem to spend a large portion of my blogging time setting the record straight. Here are just a few of my favorite posts setting the record straight on SVC back in 2007:
Since day 1, SAN Volume Controllers has focused primarily on external storage. Initially, the early models had just battery-protected DRAM cache memory, but the most recent model of the SVC, the 2145-CF8, adds support for internal SLC NAND flash solid state drives. To fully appreciate how SVC can help improve the performance of the disks that are managed, I need to use some visual aids.
In this first chart, we look at a 70/30/50 workload. This indicates that 70 percent of the IOPS are reads, 30 percent writes, and 50 percent can be satisfied as cache hits directly from the SVC. For the reads, this means that 50 percent are read-hits satisfied from SVC DRAM cache, and 50 percent are read-miss that have to get the data from the managed disk, either from the managed disk's own cache, or from the actual spinning drives inside that managed disk array.
For writes, all writes are cache-hits, but some of them will be destaged to the managed disk. Typically, we find that a third of writes are over-written before this happens, so only two-thirds are written down to managed disk.
In this example, the SVC reduced the burden of the managed disk from 100,000 IOPS down to 55,000, which is 35,000 reads and 20,000 writes. Some have argued against putting one level of cache (SVC) in front of another level of cache (managed disk arrays). However, CPU processor designers have long recognized the value of hierarchical cache with L1, L2, L3 and sometimes even L4 caches. The cache-hits on SVC are faster than most disk system's cache-hits.
This is a Ponder curve, mapping millisecond response (MSR) times for different levels of I/O per second, named after the IBM scientist John Ponder that created them. Most disk array vendors will publish similar curves for each of their products. In this case, we see that 100,000 IOPS would cause a 25 millisecond response (MSR) time, but when the load is reduced to 55,000 IOPS, the average response time drops to only 7 msec.
To be fair, the SVC does introduce 0.06 msec of additional latency on read-misses, so let's call this 7.06 msec. This tiny amount of latency could be what Hu Yoshida was referring to when he said there was "some performance degradation". There are other storage virtualization products in the market that do not provide caching to boost performance, but rather just map incoming requests to outgoing requests, and these can indeed slow down every I/O they process. Perhaps Hu was thinking of those instead of IBM's SVC when he made his comments.
Of course, not all workloads are 70/30/50, and not every disk array is driven to its maximum capability, so your mileage may vary. As we slide down the left of the curve where things are flatter, the improvement in performance lowers.
IOPS before SVC
IOPS after SVC
MSR before SVC
MSR after SVC
Hitachi's offerings, including the HDS USP-V, USP-VM and their recently announced Virtual Storage Platform (VSP) sold also by HP under the name P9500, have similar architecture to the SVC and can offer similar benefits, but oddly the Hitachi engineers have decided to treat externally attached storage as second-class citizens instead. Hu mentions data that "ages out or becomes less active we can move it to the external storage." IBM has chosen not to impose this "caste" system onto its design of the SAN Volume Controller.
The SVC has been around since 2003, before the USP-V came to market, and has sold over 20,000 SVC nodes over the past seven years. The SVC can indeed improve performance of managed disk systems, in some cases by a substantial amount. The 0.06 msec latency on read-miss requests represents less than 1 percent of total performance in production workloads. SVC nearly always improves performance, and in the worst case, provides same performance but with added functionality and flexibility. For the most part, the performance boost comes as a delightful surprise to most people who start using the SVC.
To learn more about IBM's upcoming products and how IBM will lead in storage this decade, register for next week's webcast "Taming the Information Explosion with IBM Storage" featuring Dan Galvan, IBM Vice President, and Steve Duplessie, Senior Analyst and Founder of Enterprise Storage Group (ESG).
Here I am, day 11 of a 17-day business trip, on my last leg of the trip this week, in Kuala Lumpur in Malaysia. I have been flooded with requests to give my take on EMC's latest re-interpretation of storage virtualization, VPLEX.
I'll leave it to my fellow IBM master inventor Barry Whyte to cover the detailed technical side-by-side comparison. Instead, I will focus on the business side of things, using Simon Sinek's Why-How-What sequence. Here is a [TED video] from Garr Reynold's post
[The importance of starting from Why].
Let's start with the problem we are trying to solve.
Problem: migration from old gear to new gear, old technology to new technology, from one vendor to another vendor, is disruptive, time-consuming and painful.
Given that IT storage is typically replaced every 3-5 years, then pretty much every company with an internal IT department has this problem, the exception being those companies that don't last that long, and those that use public cloud solutions. IT storage can be expensive, so companies would like their new purchases to be fully utilized on day 1, and be completely empty on day 1500 when the lease expires. I have spoken to clients who have spent 6-9 months planning for the replacement or removal of a storage array.
A solution to make the data migration non-disruptive would benefit the clients (make it easier for their IT staff to keep their data center modern and current) as well as the vendors (reduce the obstacle of selling and deploying new features and functions). Storage virtualization can be employed to help solve this problem. I define virtualization as "technology that makes one set of resources look and feel like a different set of resources, preferably with more desirable characteristics.". By making different storage resources, old and new, look and feel like a single type of resource, migration can be performed without disrupting applications.
Before VPLEX, here is a breakdown of each solution:
Non-disruptive tech refresh, and a unified platform to provide management and functionality across heterogeneous storage.
Non-disruptive tech refresh, and a unified platform to provide management and functionality between internal tier-1 HDS storage, and external tier-2 heterogeneous storage.
Non-disruptive tech refresh, with unified multi-pathing driver that allows host attachment of heterogeneous storage.
New in-band storage virtualization device
Add in-band storage virtualization to existing storage array
New out-of-band storage virtualization device with new "smart" SAN switches
SAN Volume Controller
HDS USP-V and USP-VM
For IBM, the motivation was clear: Protect customers existing investment in older storage arrays and introduce new IBM storage with a solution that allows both to be managed with a single set of interfaces and provide a common set of functionality, improving capacity utilization and availability. IBM SAN Volume Controller eliminated vendor lock-in, providing clients choice in multi-pathing driver, and allowing any-to-any migration and copy services. For example, IBM SVC can be used to help migrate data from an old HDS USP-V to a new HDS USP-V.
With EMC, however, the motivation appeared to protect software revenues from their PowerPath multi-pathing driver, TimeFinder and SRDF copy services. Back in 2005, when EMC Invista was first announced, these three software represented 60 percent of EMC's bottom-line profit. (Ok, I made that last part up, but you get my point! EMC charges a lot for these.)
Back in 2006, fellow blogger Chuck Hollis (EMC) suggested that SVC was just a [bump in the wire] which could not possibly improve performance of existing disk arrays. IBM showed clients that putting cache(SVC) in front of other cache(back end devices) does indeed improve performance, in the same way that multi-core processors successfully use L1/L2/L3 cache. Now, EMC is claiming their cache-based VPLEX improves performance of back-end disk. My how EMC's story has changed!
So now, EMC announces VPLEX, which sports a blend of SVC-like and Invista-like characteristics. Based on blogs, tweets and publicly available materials I found on EMC's website, I have been able to determine the following comparison table. (Of course, VPLEX is not yet generally available, so what is eventually delivered may differ.)
Scalable, 1 to 4 node-pairs
One size fits all, single pair of CPCs
SVC-like, 1 to 4 director-pairs
Works with any SAN switches or directors
Required special "smart" switches (vendor lock-in)
SVC-like, works with any SAN switches or directors
Broad selection of IBM Subsystem Device Driver (SDD) offered at no additional charge, as well as OS-native drivers Windows MPIO, AIX MPIO, Solaris MPxIO, HP-UX PV-Links, VMware MPP, Linux DM-MP, and comercial third-party driver Symantec DMP.
Limited selection, with focus on priced PowerPath driver
Invista-like, PowerPath and Windows MPIO
Read cache, and choice of fast-write or write-through cache, offering the ability to improve performance.
No cache, Split-Path architecture cracked open Fibre Channel packets in flight, delayed every IO by 20 nanoseconds, and redirected modified packets to the appropriate physical device.
SVC-like, Read and write-through cache, offering the ability to improve performance.
Space-Efficient Point-in-Time copies
SVC FlashCopy supports up to 256 space-efficient targets, copies of copies, read-only or writeable, and incremental persistent pairs.
Like Invista, No
Remote distance mirror
Choice of SVC Metro Mirror (synchronous up to 300km) and Global Mirror (asynchronous), or use the functionality of the back-end storage arrays
No native support, use functionality of back-end storage arrays, or purchase separate product called EMC RecoverPoint to cover this lack of functionality
Limited synchronous remote-distance mirror within VPLEX (up to 100km only), no native asynchronous support, use functionality of back-end storage arrays
Provides thin provisioning to devices that don't offer this natively
Like Invista, No
SVC Split-Cluster allows concurrent read/write access of data to be accessed from hosts at two different locations several miles apart
I don't think so
PLEX-Metro, similar in concept but implemented differently
Non-disruptive tech refresh
Can upgrade or replace storage arrays, SAN switches, and even the SVC nodes software AND hardware themselves, non-disruptively
Tech refresh for storage arrays, but not for Invista CPCs
Tech refresh of back end devices, and upgrade of VPLEX software, non-disruptively. Not clear if VPLEX engines themselves can be upgraded non-disruptively like the SVC.
Heterogeneous Storage Support
Broad support of over 140 different storage models from all major vendors, including all CLARiiON, Symmetrix and VMAX from EMC, and storage from many smaller startups you may not have heard of
Invista-like. VPLEX claims to support a variety of arrays from a variety of vendors, but as far as I can find, only DS8000 supported from the list of IBM devices. Fellow blogger Barry Burke (EMC) suggests [putting SVC between VPLEX and third party storage devices] to get the heterogeneous coverage most companies demand.
Back-end storage requirement
Must define quorum disks on any IBM or non-IBM back end storage array. SVC can run entirely on non-IBM storage arrays
HP SVSP-like, requires at least one EMC storage array to hold metadata
SVC 2145-CF8 model supports up to four solid-state drives (SSD) per node that can treated as managed disk to store end-user data
Invista-like. VPLEX has an internal 30GB SSD, but this is used only for operating system and logs, not for end-user data.
In-band virtualization solutions from IBM and HDS dominate the market. Being able to migrate data from old devices to new ones non-disruptively turned out to be only the [tip of the iceberg] of benefits from storage virtualization. In today's highly virtualized server environment, being able to non-disruptively migrate data comes in handy all the time. SVC is one of the best storage solutions for VMware, Hyper-V, XEN and PowerVM environments. EMC watched and learned in the shadows, taking notes of what people like about the SVC, and decided to follow IBM's time-tested leadership to provide a similar offering.
EMC re-invented the wheel, and it is round. On a scale from Invista (zero) to SVC (ten), I give EMC's new VPLEX a six.
“In times of universal deceit, telling the truth will be a revolutionary act.”
-- George Orwell
Well, it has been over two years since I first covered IBM's acquisition of the XIV company. Amazingly, I still see a lot of misperceptions out in the blogosphere, especially those regarding double drive failures for the XIV storage system. Despite various attempts to [explain XIV resiliency] and to [dispel the rumors], there are still competitors making stuff up, putting fear, uncertainty and doubt into the minds of prospective XIV clients.
Clients love the IBM XIV storage system! In this economy, companies are not stupid. Before buying any enterprise-class disk system, they ask the tough questions, run evaluation tests, and all the other due diligence often referred to as "kicking the tires". Here is what some IBM clients have said about their XIV systems:
“3-5 minutes vs. 8-10 hours rebuild time...”
-- satisfied XIV client
“...we tested an entire module failure - all data is re-distributed in under 6 hours...only 3-5% performance degradation during rebuild...”
-- excited XIV client
“Not only did XIV meet our expectations, it greatly exceeded them...”
In this blog post, I hope to set the record straight. It is not my intent to embarrass anyone in particular, so instead will focus on a fact-based approach.
Fact: IBM has sold THOUSANDS of XIV systems
XIV is "proven" technology with thousands of XIV systems in company data centers. And by systems, I mean full disk systems with 6 to 15 modules in a single rack, twelve drives per module. That equates to hundreds of thousands of disk drives in production TODAY, comparable to the number of disk drives studied by [Google], and [Carnegie Mellon University] that I discussed in my blog post [Fleet Cars and Skin Cells].
Fact: To date, no customer has lost data as a result of a Double Drive Failure on XIV storage system
This has always been true, both when XIV was a stand-alone company and since the IBM acquisition two years ago. When examining the resilience of an array to any single or multiple component failures, it's important to understand the architecture and the design of the system and not assume all systems are alike. At it's core, XIV is a grid-based storage system. IBM XIV does not use traditional RAID-5 or RAID-10 method, but instead data is distributed across loosely connected data modules which act as independent building blocks. XIV divides each LUN into 1MB "chunks", and stores two copies of each chunk on separate drives in separate modules. We call this "RAID-X".
Spreading all the data across many drives is not unique to XIV. Many disk systems, including EMC CLARiiON-based V-Max, HP EVA, and Hitachi Data Systems (HDS) USP-V, allow customers to get XIV-like performance by spreading LUNs across multiple RAID ranks. This is known in the industry as "wide-striping". Some vendors use the terms "metavolumes" or "extent pools" to refer to their implementations of wide-striping. Clients have coined their own phrases, such as "stripes across stripes", "plaid stripes", or "RAID 500". It is highly unlikely that an XIV will experience a double drive failure that ultimately requires recovery of files or LUNs, and is substantially less vulnerable to data loss than an EVA, USP-V or V-Max configured in RAID-5. Fellow blogger Keith Stevenson (IBM) compared XIV's RAID-X design to other forms of RAID in his post [RAID in the 21st Centure].
Fact: IBM XIV is designed to minimize the likelihood and impact of a double drive failure
The independent failure of two drives is a rare occurrence. More data has been lost from hash collisions on EMC Centera than from double drive failures on XIV, and hash collisions are also very rare. While the published worst-case time to re-protect from a 1TB drive failure for a fully-configured XIV is 30 minutes, field experience shows XIV regaining full redundancy on average in 12 minutes. That is 40 times less likely than a typical 8-10 hour window for a RAID-5 configuration.
A lot of bad things can happen in those 8-10 hours of traditional RAID rebuild. Performance can be seriously degraded. Other components may be affected, as they share cache, connected to the same backplane or bus, or co-dependent in some other manner. An engineer supporting the customer onsite during a RAID-5 rebuild might pull the wrong drive, thereby causing a double drive failure they were hoping to avoid. Having IBM XIV rebuild in only a few minutes addresses this "human factor".
In his post [XIV drive management], fellow blogger Jim Kelly (IBM) covers a variety of reasons why storage admins feel double drive failures are more than just random chance. XIV avoids load stress normally associated with traditional RAID rebuild by evenly spreading out the workload across all drives. This is known in the industry as "wear-leveling". When the first drive fails, the recovery is spread across the remaining 179 drives, so that each drive only processes about 1 percent of the data. The [Ultrastar A7K1000] 1TB SATA disk drives that IBM uses from HGST have specified 1.2 million hours mean-time-between-failures [MTBF] would average about one drive failing every nine months in a 180-drive XIV system. However, field experience shows that an XIV system will experience, on average, one drive failure per 13 months, comparable to what companies experience with more robust Fibre Channel drives. That's innovative XIV wear-leveling at work!
Fact: In the highly unlikely event that a DDF were to occur, you will have full read/write access to nearly all of your data on the XIV, all but a few GB.
Even though it has NEVER happened in the field, some clients and prospects are curious what a double drive failure on an XIV would look like. First, a critical alert message would be sent to both the client and IBM, and a "union list" is generated, identifying all the chunks in common. The worst case on a 15-module XIV fully loaded with 79TB data is approximately 9000 chunks, or 9GB of data. The remaining 78.991 TB of unaffected data are fully accessible for read or write. Any I/O requests for the chunks in the "union list" will have no response yet, so there is no way for host applications to access outdated information or cause any corruption.
(One blogger compared losing data on the XIV to drilling a hole through the phone book. Mathematically, the drill bit would be only 1/16th of an inch, or 1.60 millimeters for you folks outside the USA. Enough to knock out perhaps one character from a name or phone number on each page. If you have ever seen an actor in the movies look up a phone number in a telephone booth then yank out a page from the phone book, the XIV equivalent would be cutting out 1/8th of a page from an 1100 page phone book. In both cases, all of the rest of the unaffected information is full accessible, and it is easy to identify which information is missing.)
If the second drive failed several minutes after the first drive, the process for full redundancy is already well under way. This means the union list is considerably shorter or completely empty, and substantially fewer chunks are impacted. Contrast this with RAID-5, where being 99 percent complete on the rebuild when the second drive fails is just as catastrophic as having both drives fail simultaneously.
Fact: After a DDF event, the files on these few GB can be identified for recovery.
Once IBM receives notification of a critical event, an IBM engineer immediately connects to the XIV using remote service support method. There is no need to send someone physically onsite, the repair actions can be done remotely. The IBM engineer has tools from HGST to recover, in most cases, all of the data.
Any "union" chunk that the HGST tools are unable to recover will be set to "media error" mode. The IBM engineer can provide the client a list of the XIV LUNs and LBAs that are on the "media error" list. From this list, the client can determine which hosts these LUNs are attached to, and run file scan utility to the file systems that these LUNs represent. Files that get a media error during this scan will be listed as needing recovery. A chunk could contain several small files, or the chunk could be just part of a large file. To minimize time, the scans and recoveries can all be prioritized and performed in parallel across host systems zoned to these LUNs.
As with any file or volume recovery, keep in mind that these might be part of a larger consistency group, and that your recovery procedures should make sense for the applications involved. In any case, you are probably going to be up-and-running in less time with XIV than recovery from a RAID-5 double failure would take, and certainly nowhere near "beyond repair" that other vendors might have you believe.
Fact: This does not mean you can eliminate all Disaster Recovery planning!
To put this in perspective, you are more likely to lose XIV data from an earthquake, hurricane, fire or flood than from a double drive failure. As with any unlikely disaster, it is best to have a disaster recovery plan than to hope it never happens. All disk systems that sit on a single datacenter floor are vulnerable to such disasters.
For mission-critical applications, IBM recommends using disk mirroring capability. IBM XIV storage system offers synchronous and asynchronous mirroring natively, both included at no additional charge.
A long time ago, perhaps in the early 1990s, I was an architect on the component known today as DFSMShsm on z/OS mainframe operationg system. One of my job responsibilities was to attend the biannual [SHARE conference to listen to the requirements of the attendees on what they would like added or changed to the DFSMS, and ask enough questions so that I can accurately present the reasoning to the rest of the architects and software designers on my team. One person requested that the DFSMShsm RELEASE HARDCOPY should release "all" the hardcopy. This command sends all the activity logs to the designated SYSOUT printer. I asked what he meant by "all", and the entire audience of 120 some attendees nearly fell on the floor laughing. He complained that some clever programmer wrote code to test if the activity log contained only "Starting" and "Ending" message, but no error messages, and skip those from being sent to SYSOUT. I explained that this was done to save paper, good for the environment, and so on. Again, howls of laughter. Most customers reroute the SYSOUT from DFSMS from a physical printer to a logical one that saves the logs as data sets, with date and time stamps, so having any "skipped" leaves gaps in the sequence. The client wanted a complete set of data sets for his records. Fair enough.
When I returned to Tucson, I presented the list of requests, and the immediate reaction when I presented the one above was, "What did he mean by ALL? Doesn't it release ALL of the logs already?" I then had to recap our entire dialogue, and then it all made sense to the rest of the team. At the following SHARE conference six months later, I was presented with my own official "All" tee-shirt that listed, and I am not kidding, some 33 definitions for the word "all", in small font covering the front of the shirt.
I am reminded of this story because of the challenges explaining complicated IT concepts using the English language which is so full of overloaded words that have multiple meanings. Take for example the word "protect". What does it mean when a client asks for a solution or system to "protect my data" or "protect my information". Let's take a look at three different meanings:
The first meaning is to protect the integrity of the data from within, especially from executives or accountants that might want to "fudge the numbers" to make quarterly results look better than they are, or to "change the terms of the contract" after agreements have been signed. Clients need to make sure that the people authorized to read/write data can be trusted to do so, and to store data in Non-Erasable, Non-Rewriteable (NENR) protected storage for added confidence. NENR storage includes Write-Once, Read-Many (WORM) tape and optical media, disk and disk-and-tape blended solutions such as the IBM Grid Medical Archive Solution (GMAS) and IBM Information Archive integrated system.
The second meaning is to protect access from without, especially hackers or other criminals that might want to gather personally-identifiably information (PII) such as social security numbers, health records, or credit card numbers and use these for identity theft. This is why it is so important to encrypt your data. As I mentioned in my post [Eliminating Technology Trade-Offs], IBM supports hardware-based encryption FDE drives in its IBM System Storage DS8000 and DS5000 series. These FDE drives have an AES-128 bit encryption built-in to perform the encryption in real-time. Neither HDS or EMC support these drives (yet). Fellow blogger Hu Yoshida (HDS) indicates that their USP-V has implemented data-at-rest in their array differently, using backend directors instead. I am told EMC relies on the consumption of CPU-cycles on the host servers to perform software-based encryption, either as MIPS consumed on the mainframe, or using their Powerpath multi-pathing driver on distributed systems.
There is also concern about internal employees have the right "need-to-know" of various research projects or upcoming acquisitions. On SANs, this is normally handled with zoning, and on NAS with appropriate group/owner bits and access control lists. That's fine for LUNs and files, but what about databases? IBM's DB2 offers Label-Based Access Control [LBAC] that provides a finer level of granularity, down to the row or column level. For example, if a hospital database contained patient information, the doctors and nurses would not see the columns containing credit card details, the accountants would not see the columnts containing healthcare details, and the individual patients, if they had any access at all, would only be able to access the rows related to their own records, and possibly the records of their children or other family members.
The third meaning is to protect against the unexpected. There are lots of ways to lose data: physical failure, theft or even incorrect application logic. Whatever the way, you can protect against this by having multiple copies of the data. You can either have multiple copies of the data in its entirety, or use RAID or similar encoding scheme to store parts of the data in multiple separate locations. For example, with RAID-5 rank containing 6+P+S configuration, you would have six parts of data and one part parity code scattered across seven drives. If you lost one of the disk drives, the data can be rebuilt from the remaining portions and written to the spare disk set aside for this purpose.
But what if the drive is stolen? Someone can walk up to a disk system, snap out the hot-swappable drive, and walk off with it. Since it contains only part of the data, the thief would not have the entire copy of the data, so no reason to encrypt it, right? Wrong! Even with part of the data, people can get enough information to cause your company or customers harm, lose business, or otherwise get you in hot water. Encryption of the data at rest can help protect against unauthorized access to the data, even in the case when the data is scattered in this manner across multiple drives.
To protect against site-wide loss, such as from a natural disaster, fire, flood, earthquake and so on, you might consider having data replicated to remote locations. For example, IBM's DS8000 offers two-site and three-site mirroring. Two-site options include Metro Mirror (synchronous) and Global Mirror (asynchronous). The three-site is cascaded Metro/Global Mirror with the second site nearby (within 300km) and the third site far away. For example, you can have two copies of your data at site 1, a third copy at nearby site 2, and two more copies at site 3. Five copies of data in three locations. IBM DS8000 can send this data over from one box to another with only a single round trip (sending the data out, and getting an acknowledgment back). By comparison, EMC SRDF/S (synchronous) takes one or two trips depending on blocksize, for example blocks larger than 32KB require two trips, and EMC SRDF/A (asynchronous) always takes two trips. This is important because for many companies, disk is cheap but long-distance bandwidth is quite expensive. Having five copies in three locations could be less expensive than four copies in four locations.
Fellow blogger BarryB (EMC Storage Anarchist) felt I was unfair pointing out that their EMC Atmos GeoProtect feature only protects against "unexpected loss" and does not eliminate the need for encryption or appropriate access control lists to protect against "unauthorized access" or "unethical tampering".
(It appears I stepped too far on to ChuckH's lawn, as his Rottweiler BarryB came out barking, both in the [comments on my own blog post], as well as his latest titled [IBM dumbs down IBM marketing (again)]. Before I get another rash of comments, I want to emphasize this is a metaphor only, and that I am not accusing BarryB of having any canine DNA running through his veins, nor that Chuck Hollis has a lawn.)
As far as I know, the EMC Atmos does not support FDE disks that do this encryption for you, so you might need to find another way to encrypt the data and set up the appropriate access control lists. I agree with BarryB that "erasure codes" have been around for a while and that there is nothing unsafe about using them in this manner. All forms of RAID-5, RAID-6 and even RAID-X on the IBM XIV storage system can be considered a form of such encoding as well. As for the amount of long-distance bandwidth that Atmos GeoProtect would consume to provide this protection against loss, you might question any cost savings from this space-efficient solution. As always, you should consider both space and bandwidth costs in your total cost of ownership calculations.
Of course, if saving money is your main concern, you should consider tape, which can be ten to twenty times cheaper than disk, affording you to keep a dozen or more copies, in as many time zones, at substantially lower cost. These can be encrypted and written to WORM media for even more thorough protection.