Tony Pearson is a Master Inventor and Senior IT Architect for the IBM Storage product line at the
IBM Systems Client Experience Center in Tucson Arizona, and featured contributor
to IBM's developerWorks. In 2016, Tony celebrates his 30th year anniversary with IBM Storage. He is
author of the Inside System Storage series of books. This blog is for the open exchange of ideas relating to storage and storage networking hardware, software and services.
(Short URL for this blog: ibm.co/Pearson )
My books are available on Lulu.com! Order your copies today!
Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.
Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.
Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.
Tony Pearson is not a medical doctor, and this blog does not reference any IBM product or service that is intended for use in the diagnosis, treatment, cure, prevention or monitoring of a disease or medical condition, unless otherwise specified on individual posts.
“In times of universal deceit, telling the truth will be a revolutionary act.”
-- George Orwell
Well, it has been over two years since I first covered IBM's acquisition of the XIV company. Amazingly, I still see a lot of misperceptions out in the blogosphere, especially those regarding double drive failures for the XIV storage system. Despite various attempts to [explain XIV resiliency] and to [dispel the rumors], there are still competitors making stuff up, putting fear, uncertainty and doubt into the minds of prospective XIV clients.
Clients love the IBM XIV storage system! In this economy, companies are not stupid. Before buying any enterprise-class disk system, they ask the tough questions, run evaluation tests, and all the other due diligence often referred to as "kicking the tires". Here is what some IBM clients have said about their XIV systems:
“3-5 minutes vs. 8-10 hours rebuild time...”
-- satisfied XIV client
“...we tested an entire module failure - all data is re-distributed in under 6 hours...only 3-5% performance degradation during rebuild...”
-- excited XIV client
“Not only did XIV meet our expectations, it greatly exceeded them...”
In this blog post, I hope to set the record straight. It is not my intent to embarrass anyone in particular, so instead will focus on a fact-based approach.
Fact: IBM has sold THOUSANDS of XIV systems
XIV is "proven" technology with thousands of XIV systems in company data centers. And by systems, I mean full disk systems with 6 to 15 modules in a single rack, twelve drives per module. That equates to hundreds of thousands of disk drives in production TODAY, comparable to the number of disk drives studied by [Google], and [Carnegie Mellon University] that I discussed in my blog post [Fleet Cars and Skin Cells].
Fact: To date, no customer has lost data as a result of a Double Drive Failure on XIV storage system
This has always been true, both when XIV was a stand-alone company and since the IBM acquisition two years ago. When examining the resilience of an array to any single or multiple component failures, it's important to understand the architecture and the design of the system and not assume all systems are alike. At it's core, XIV is a grid-based storage system. IBM XIV does not use traditional RAID-5 or RAID-10 method, but instead data is distributed across loosely connected data modules which act as independent building blocks. XIV divides each LUN into 1MB "chunks", and stores two copies of each chunk on separate drives in separate modules. We call this "RAID-X".
Spreading all the data across many drives is not unique to XIV. Many disk systems, including EMC CLARiiON-based V-Max, HP EVA, and Hitachi Data Systems (HDS) USP-V, allow customers to get XIV-like performance by spreading LUNs across multiple RAID ranks. This is known in the industry as "wide-striping". Some vendors use the terms "metavolumes" or "extent pools" to refer to their implementations of wide-striping. Clients have coined their own phrases, such as "stripes across stripes", "plaid stripes", or "RAID 500". It is highly unlikely that an XIV will experience a double drive failure that ultimately requires recovery of files or LUNs, and is substantially less vulnerable to data loss than an EVA, USP-V or V-Max configured in RAID-5. Fellow blogger Keith Stevenson (IBM) compared XIV's RAID-X design to other forms of RAID in his post [RAID in the 21st Centure].
Fact: IBM XIV is designed to minimize the likelihood and impact of a double drive failure
The independent failure of two drives is a rare occurrence. More data has been lost from hash collisions on EMC Centera than from double drive failures on XIV, and hash collisions are also very rare. While the published worst-case time to re-protect from a 1TB drive failure for a fully-configured XIV is 30 minutes, field experience shows XIV regaining full redundancy on average in 12 minutes. That is 40 times less likely than a typical 8-10 hour window for a RAID-5 configuration.
A lot of bad things can happen in those 8-10 hours of traditional RAID rebuild. Performance can be seriously degraded. Other components may be affected, as they share cache, connected to the same backplane or bus, or co-dependent in some other manner. An engineer supporting the customer onsite during a RAID-5 rebuild might pull the wrong drive, thereby causing a double drive failure they were hoping to avoid. Having IBM XIV rebuild in only a few minutes addresses this "human factor".
In his post [XIV drive management], fellow blogger Jim Kelly (IBM) covers a variety of reasons why storage admins feel double drive failures are more than just random chance. XIV avoids load stress normally associated with traditional RAID rebuild by evenly spreading out the workload across all drives. This is known in the industry as "wear-leveling". When the first drive fails, the recovery is spread across the remaining 179 drives, so that each drive only processes about 1 percent of the data. The [Ultrastar A7K1000] 1TB SATA disk drives that IBM uses from HGST have specified 1.2 million hours mean-time-between-failures [MTBF] would average about one drive failing every nine months in a 180-drive XIV system. However, field experience shows that an XIV system will experience, on average, one drive failure per 13 months, comparable to what companies experience with more robust Fibre Channel drives. That's innovative XIV wear-leveling at work!
Fact: In the highly unlikely event that a DDF were to occur, you will have full read/write access to nearly all of your data on the XIV, all but a few GB.
Even though it has NEVER happened in the field, some clients and prospects are curious what a double drive failure on an XIV would look like. First, a critical alert message would be sent to both the client and IBM, and a "union list" is generated, identifying all the chunks in common. The worst case on a 15-module XIV fully loaded with 79TB data is approximately 9000 chunks, or 9GB of data. The remaining 78.991 TB of unaffected data are fully accessible for read or write. Any I/O requests for the chunks in the "union list" will have no response yet, so there is no way for host applications to access outdated information or cause any corruption.
(One blogger compared losing data on the XIV to drilling a hole through the phone book. Mathematically, the drill bit would be only 1/16th of an inch, or 1.60 millimeters for you folks outside the USA. Enough to knock out perhaps one character from a name or phone number on each page. If you have ever seen an actor in the movies look up a phone number in a telephone booth then yank out a page from the phone book, the XIV equivalent would be cutting out 1/8th of a page from an 1100 page phone book. In both cases, all of the rest of the unaffected information is full accessible, and it is easy to identify which information is missing.)
If the second drive failed several minutes after the first drive, the process for full redundancy is already well under way. This means the union list is considerably shorter or completely empty, and substantially fewer chunks are impacted. Contrast this with RAID-5, where being 99 percent complete on the rebuild when the second drive fails is just as catastrophic as having both drives fail simultaneously.
Fact: After a DDF event, the files on these few GB can be identified for recovery.
Once IBM receives notification of a critical event, an IBM engineer immediately connects to the XIV using remote service support method. There is no need to send someone physically onsite, the repair actions can be done remotely. The IBM engineer has tools from HGST to recover, in most cases, all of the data.
Any "union" chunk that the HGST tools are unable to recover will be set to "media error" mode. The IBM engineer can provide the client a list of the XIV LUNs and LBAs that are on the "media error" list. From this list, the client can determine which hosts these LUNs are attached to, and run file scan utility to the file systems that these LUNs represent. Files that get a media error during this scan will be listed as needing recovery. A chunk could contain several small files, or the chunk could be just part of a large file. To minimize time, the scans and recoveries can all be prioritized and performed in parallel across host systems zoned to these LUNs.
As with any file or volume recovery, keep in mind that these might be part of a larger consistency group, and that your recovery procedures should make sense for the applications involved. In any case, you are probably going to be up-and-running in less time with XIV than recovery from a RAID-5 double failure would take, and certainly nowhere near "beyond repair" that other vendors might have you believe.
Fact: This does not mean you can eliminate all Disaster Recovery planning!
To put this in perspective, you are more likely to lose XIV data from an earthquake, hurricane, fire or flood than from a double drive failure. As with any unlikely disaster, it is best to have a disaster recovery plan than to hope it never happens. All disk systems that sit on a single datacenter floor are vulnerable to such disasters.
For mission-critical applications, IBM recommends using disk mirroring capability. IBM XIV storage system offers synchronous and asynchronous mirroring natively, both included at no additional charge.
Here I am, day 11 of a 17-day business trip, on my last leg of the trip this week, in Kuala Lumpur in Malaysia. I have been flooded with requests to give my take on EMC's latest re-interpretation of storage virtualization, VPLEX.
I'll leave it to my fellow IBM master inventor Barry Whyte to cover the detailed technical side-by-side comparison. Instead, I will focus on the business side of things, using Simon Sinek's Why-How-What sequence. Here is a [TED video] from Garr Reynold's post
[The importance of starting from Why].
Let's start with the problem we are trying to solve.
Problem: migration from old gear to new gear, old technology to new technology, from one vendor to another vendor, is disruptive, time-consuming and painful.
Given that IT storage is typically replaced every 3-5 years, then pretty much every company with an internal IT department has this problem, the exception being those companies that don't last that long, and those that use public cloud solutions. IT storage can be expensive, so companies would like their new purchases to be fully utilized on day 1, and be completely empty on day 1500 when the lease expires. I have spoken to clients who have spent 6-9 months planning for the replacement or removal of a storage array.
A solution to make the data migration non-disruptive would benefit the clients (make it easier for their IT staff to keep their data center modern and current) as well as the vendors (reduce the obstacle of selling and deploying new features and functions). Storage virtualization can be employed to help solve this problem. I define virtualization as "technology that makes one set of resources look and feel like a different set of resources, preferably with more desirable characteristics.". By making different storage resources, old and new, look and feel like a single type of resource, migration can be performed without disrupting applications.
Before VPLEX, here is a breakdown of each solution:
Non-disruptive tech refresh, and a unified platform to provide management and functionality across heterogeneous storage.
Non-disruptive tech refresh, and a unified platform to provide management and functionality between internal tier-1 HDS storage, and external tier-2 heterogeneous storage.
Non-disruptive tech refresh, with unified multi-pathing driver that allows host attachment of heterogeneous storage.
New in-band storage virtualization device
Add in-band storage virtualization to existing storage array
New out-of-band storage virtualization device with new "smart" SAN switches
SAN Volume Controller
HDS USP-V and USP-VM
For IBM, the motivation was clear: Protect customers existing investment in older storage arrays and introduce new IBM storage with a solution that allows both to be managed with a single set of interfaces and provide a common set of functionality, improving capacity utilization and availability. IBM SAN Volume Controller eliminated vendor lock-in, providing clients choice in multi-pathing driver, and allowing any-to-any migration and copy services. For example, IBM SVC can be used to help migrate data from an old HDS USP-V to a new HDS USP-V.
With EMC, however, the motivation appeared to protect software revenues from their PowerPath multi-pathing driver, TimeFinder and SRDF copy services. Back in 2005, when EMC Invista was first announced, these three software represented 60 percent of EMC's bottom-line profit. (Ok, I made that last part up, but you get my point! EMC charges a lot for these.)
Back in 2006, fellow blogger Chuck Hollis (EMC) suggested that SVC was just a [bump in the wire] which could not possibly improve performance of existing disk arrays. IBM showed clients that putting cache(SVC) in front of other cache(back end devices) does indeed improve performance, in the same way that multi-core processors successfully use L1/L2/L3 cache. Now, EMC is claiming their cache-based VPLEX improves performance of back-end disk. My how EMC's story has changed!
So now, EMC announces VPLEX, which sports a blend of SVC-like and Invista-like characteristics. Based on blogs, tweets and publicly available materials I found on EMC's website, I have been able to determine the following comparison table. (Of course, VPLEX is not yet generally available, so what is eventually delivered may differ.)
Scalable, 1 to 4 node-pairs
One size fits all, single pair of CPCs
SVC-like, 1 to 4 director-pairs
Works with any SAN switches or directors
Required special "smart" switches (vendor lock-in)
SVC-like, works with any SAN switches or directors
Broad selection of IBM Subsystem Device Driver (SDD) offered at no additional charge, as well as OS-native drivers Windows MPIO, AIX MPIO, Solaris MPxIO, HP-UX PV-Links, VMware MPP, Linux DM-MP, and comercial third-party driver Symantec DMP.
Limited selection, with focus on priced PowerPath driver
Invista-like, PowerPath and Windows MPIO
Read cache, and choice of fast-write or write-through cache, offering the ability to improve performance.
No cache, Split-Path architecture cracked open Fibre Channel packets in flight, delayed every IO by 20 nanoseconds, and redirected modified packets to the appropriate physical device.
SVC-like, Read and write-through cache, offering the ability to improve performance.
Space-Efficient Point-in-Time copies
SVC FlashCopy supports up to 256 space-efficient targets, copies of copies, read-only or writeable, and incremental persistent pairs.
Like Invista, No
Remote distance mirror
Choice of SVC Metro Mirror (synchronous up to 300km) and Global Mirror (asynchronous), or use the functionality of the back-end storage arrays
No native support, use functionality of back-end storage arrays, or purchase separate product called EMC RecoverPoint to cover this lack of functionality
Limited synchronous remote-distance mirror within VPLEX (up to 100km only), no native asynchronous support, use functionality of back-end storage arrays
Provides thin provisioning to devices that don't offer this natively
Like Invista, No
SVC Split-Cluster allows concurrent read/write access of data to be accessed from hosts at two different locations several miles apart
I don't think so
PLEX-Metro, similar in concept but implemented differently
Non-disruptive tech refresh
Can upgrade or replace storage arrays, SAN switches, and even the SVC nodes software AND hardware themselves, non-disruptively
Tech refresh for storage arrays, but not for Invista CPCs
Tech refresh of back end devices, and upgrade of VPLEX software, non-disruptively. Not clear if VPLEX engines themselves can be upgraded non-disruptively like the SVC.
Heterogeneous Storage Support
Broad support of over 140 different storage models from all major vendors, including all CLARiiON, Symmetrix and VMAX from EMC, and storage from many smaller startups you may not have heard of
Invista-like. VPLEX claims to support a variety of arrays from a variety of vendors, but as far as I can find, only DS8000 supported from the list of IBM devices. Fellow blogger Barry Burke (EMC) suggests [putting SVC between VPLEX and third party storage devices] to get the heterogeneous coverage most companies demand.
Back-end storage requirement
Must define quorum disks on any IBM or non-IBM back end storage array. SVC can run entirely on non-IBM storage arrays
HP SVSP-like, requires at least one EMC storage array to hold metadata
SVC 2145-CF8 model supports up to four solid-state drives (SSD) per node that can treated as managed disk to store end-user data
Invista-like. VPLEX has an internal 30GB SSD, but this is used only for operating system and logs, not for end-user data.
In-band virtualization solutions from IBM and HDS dominate the market. Being able to migrate data from old devices to new ones non-disruptively turned out to be only the [tip of the iceberg] of benefits from storage virtualization. In today's highly virtualized server environment, being able to non-disruptively migrate data comes in handy all the time. SVC is one of the best storage solutions for VMware, Hyper-V, XEN and PowerVM environments. EMC watched and learned in the shadows, taking notes of what people like about the SVC, and decided to follow IBM's time-tested leadership to provide a similar offering.
EMC re-invented the wheel, and it is round. On a scale from Invista (zero) to SVC (ten), I give EMC's new VPLEX a six.
My series last week on IBM Watson (which you can read [here], [here], [here], and [here]) brought attention to IBM's Scale-Out Network Attached Storage [SONAS]. IBM Watson used a customized version of SONAS technology for its internal storage, and like most of the components of IBM Watson, IBM SONAS is commercially available as a stand-alone product.
Like many IBM products, SONAS has gone through various name changes. First introduced by Linda Sanford at an IBM SHARE conference in 2000 under the IBM Research codename Storage Tank, it was then delivered as a software-only offering SAN File System, then as a services offering Scale-out File Services (SoFS), and now as an integrated system appliance, SONAS, in IBM's Cloud Services and Systems portfolio.
If you are not familiar with SONAS, here are a few of my previous posts that go into more detail:
This week, IBM announces that SONAS has set a world record benchmark for performance, [a whopping 403,326 IOPS for a single file system]. The results are based on comparisons of publicly available information from Standard Performance Evaluation Corporation [SPEC], a prominent performance standardization organization with more than 60 member companies. SPEC publishes hundreds of different performance results each quarter covering a wide range of system performance disciplines (CPU, memory, power, and many more). SPECsfs2008_nfs.v3 is the industry-standard benchmark for NAS systems using the NFS protocol.
(Disclaimer: Your mileage may vary. As with any performance benchmark, the SPECsfs benchmark does not replicate any single workload or particular application. Rather, it encapsulates scores of typical activities on a NAS storage system. SPECsfs is based on a compilation of workload data submitted to the SPEC organization, aggregated from tens of thousands of fileservers, using a wide variety of environments and applications. As a result, it is comprised of typical workloads and with typical proportions of data and metadata use as seen in real production environments.)
The configuration tested involves SONAS Release 1.2 on 10 Interface Nodes and 8 Storage Pods, resulting a single file system over 900TB usable capacity.
10 Interface Nodes; each with:
Maximum 144 GB of memory
One active 10GbE port
8 Storage Pods; each with:
2 Storage nodes and 240 drives
Drive type: 15K RPM SAS hard drives
Data Protection using RAID-5 (8+P) ranks
Six spare drives per Storage Pod
IBM wanted a realistic "no compromises" configuration to be tested, by choosing:
Regular 15K RPM SAS drives, rather than a silly configuration full of super-expensive Solid State Drives (SSD) to plump up the results.
Moderate size, typical of what clients are asking for today. The Goldilocks rule applies. This SONAS is not a small configuration under 100TB, and nowhere close to the maximum supported configuration of 7,200 disks across 30 Interface Nodes and 30 Storage Pods.
Single file system, often referred to as a global name space, rather than using an aggregate of smaller file systems added together that would be more complicated to manage. Having multiple file systems often requires changes to applications to take advantage of the aggregate peformance. It is also more difficult to load-balance your performance and capacity across multiple file systems. Of course, SONAS can support up to 256 separate file systems if you have a business need for this complexity.
The results are stunning. IBM SONAS handled three times more workload for a single file system than the next leading contender. All of the major players are there as well, including NetApp, EMC and HP.
Well, it's Wednesday, and you know what that means... IBM Announcements!
(Actually most IBM announcements are on Tuesdays, but IBM gave me extra time to recover from my trip to Europe!)
Today, IBM announced [IBM PureSystems], a new family of expert-integrated systems that combine storage, servers, networking, and software, based on IBM's decades of experience in the IT industry. You can register for the [Launch Event] today (April 11) at 2pm EDT, and download the companion "Integrated Expertise" event app for Apple, Android or Blackberry smartphones.
(If you are thinking, "Hey, wait a minute, hasn't this been done before?" you are not alone. Yes, IBM introduced the System/360 back in 1964, and the AS/400 back in 1988, so today's announcement is on scheduled for this 24-year cycle. Based on IBM's past success in this area, others have followed, most recently, Oracle, HP and Cisco.)
Initially, there are two offerings:
IBM PureFlex™ System
IBM PureFlex is like IaaS-in-a-box, allowing you to manage the system as a pool of virtual resources. It can be used for private cloud deployments, hybrid cloud deployments, or by service providers to offer public cloud solutions. IBM drinks its own champagne, and will have no problem integrating these into its [IBM SmartCloud] offerings.
To simplify ordering, the IBM PureFlex comes in three tee-shirt sizes: Express, Standard and Enterprise.
IBM PureFlex is based on a 10U-high, 19-inch wide, standard rack-mountable chassis that holds 14 bays, organized in a 7 by 2 matrix. Unlike BladeCenter where blades are inserted vertically, the IBM PureFlex nodes are horizontal. Some of the nodes take up a single bay (half-wide), but a few are full-wide, take up two bays, the full 19-inch width of the chassis. Compute and storage snap in the front, while power supplies, fans, and networking snap in the back. You can fit up to four chassis in a standard 42U rack.
Unlike competitive offerings, IBM does not limit you to x86 architectures. Both x86 and POWER-based compute nodes can be mixed into a single chassis. Out of the box, the IBM PureFlex supports four operating systems (AIX, IBM i, Linux and Windows), four server hypervisors (Hyper-V, Linux KVM, PowerVM, and VMware), and two storage hypervisors (SAN Volume Controller and Storwize V7000).
There are a variety of storage options for this. IBM will offer SSD and HDD inside the compute nodes themselves, direct-attached storage nodes, and an integrated version of the Storwize V7000 disk system. Of course, every IBM System Storage product is supported as external storage. Since Storwize V7000 and SAN Volume Controller support external virtualization, many non-IBM devices will be supported automatically as well.
Networking is also optimized, with options for 10Gb and 40Gb Ethernet/FCoE, 40Gb and 56Gb Infiniband, 8Gbps and 16Gbps Fibre Channel. Much of the networking traffic can be handled within the chassis, to minimize traffic on external switches and directors.
For management, IBM offers the Flex System Manager, that allows you to manage all the resources from a single pane of glass. The goal is to greatly simplify the IT lifecycle experience of procurement, installation, deployment and maintenance.
IBM PureApplication™ System
IBM PureApplication is like PaaS-in-a-box. Based on the IBM PureFlex infrastructure, the IBM PureApplication adds additional software layers focused on transactional web, business logic, and database workloads. Initially, it will offer two platforms: Linux platform based on x86 processors, Linux KVM and Red Hat Enterprise Linux (RHEL); and a UNIX platform based on POWER7 processors, PowerVM and AIX operating system. It will be offered in four tee-shirt sizes (small, medium, large and extra large).
In addition to having IBM's middleware like DB2 and WebSphere optimized for this platform, over 600 companies will announce this week that they will support and participate in the IBM PureSystems ecosystem as well. Already, there are 150 "Patterns of Expertise" ready to deploy from IBM PureSystem Centre, a kind of a "data center app store", borrowing an idea used today with smartphones.
By packaging applications in this manner, workloads can easily shift between private, hybrid and public clouds.
If you are unhappy with the inflexibility of your VCE Vblock, HP Integrity, or Oracle ExaLogic, talk to your local IBM Business Partner or Sales Representative. We might be able to buy your boat anchor off your hands, as part of an IBM PureSystems sale, with an attractive IBM Global Financing plan.
Each quarter since 2006, the [IBM Migration Factory] team has tallied the number of clients who have moved to IBM severs and storage systems from competitive hardware. We'll I've just seen the latest numbers, for the third quarter of 2010, and it looks like we set a new quarterly record with nearly 400 total migrations to IBM from Oracle/Sun and HP.
It's clear that companies and governments worldwide are seeing greater value in IBM systems, while Oracle and HP watch their customer bases erode. In just this past 3Q 2010, nearly 400 clients have moved over to IBM -- almost all of them from Oracle/Sun and HP. Of these, 286 clients migrated to IBM Power Systems, running AIX, Linux and IBM i operating systems, from competitors alone -- nearly 175 from Oracle/Sun and nearly 100 from HP. The number of migrations to IBM Power Systems through the first three quarters of 2010 is nearly 800, already exceeding the total for all of last year by more than 200.
Let's do the math.... Since IBM established its Migration Factory program in 2006, more than 4,500 clients have switched to IBM. More than 1,000 from Oracle/Sun and HP joined the exodus this year alone. In less than five years, almost 3,000 of these clients -- including more than 1,500 from Oracle/Sun and more than 1,000 from HP -- have chosen to run their businesses on IBM's Power Systems. That's more than a client per day making the move to IBM!
And as the servers go, so goes the storage. Clients are re-discovering IBM as a server and storage powerhouse, offering a strong portfolio in servers, disk and tape systems, and how synergies between servers and storage can provide them real business benefits.
Adding it all up, it's clear that IBM's multi-billion dollar investment in helping to build a smarter planet with workload-optimized systems is paying off -- and that, more and more, clients are selecting IBM over the competition to help them meet their business needs.