Tony Pearson is a Master Inventor and Senior IT Architect for the IBM Storage product line at the
IBM Executive Briefing Center in Tucson Arizona, and featured contributor
to IBM's developerWorks. In 2016, Tony celebrates his 30th year anniversary with IBM Storage. He is
author of the Inside System Storage series of books. This blog is for the open exchange of ideas relating to storage and storage networking hardware, software and services.
(Short URL for this blog: ibm.co/Pearson )
My books are available on Lulu.com! Order your copies today!
Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.
Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.
Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.
Tony Pearson is not a medical doctor, and this blog does not reference any IBM product or service that is intended for use in the diagnosis, treatment, cure, prevention or monitoring of a disease or medical condition, unless otherwise specified on individual posts.
Christopher Carfi on his Social Customer Manifesto blog has a great post[Let's Look at the Big Picture]that talks about Information as the new form of "money" by looking at how the concept of "money" wasfirst formed 150 years ago. Here's an excerpt:
Lesson 1: "Money" was very fragmented for a very long period of time after the colonization of North America
"Money" as we think of it in the form of cash/paper currency has only been around for about 150 years. Over a period of almost two hundred years both before and after that time, a number of fragmented methods were used to exchange value.
Lesson 2: Everybody needs to win
After the ideas of "cash" and "checks" had taken hold and become widespread, there were still many inefficiencies in the system. Cash is cumbersome, and subject to loss. Checks may bounce. This continued until the mid-1900's.
Enter the credit card.*
The credit card resonated with both customers and vendors because both parties received benefits.
Now, the widespread usage of credit cards was not something the occurred overnight. Instead, it was something that occurred over a generation. In 1970, only 16% of American households had credit cards. However, by 1995, that number had climbed to 65%.
We are now looking at Information in much the same way. It is fragmented, it is used to represent value, it is hoarded by some, shared by others. In much that "brown" is the new "black", does that mean "information" is the new"money"?
A related blog post from Shawn over at Anecdote discusses a panelist discussion of Albert Camus' work,The Stranger. Here is an excerpt:
... meaning is not pre-inscribed in the world around us and we are continuously seeking meaning in an inherently meaningless world. I almost toppled off the step machine. Do we live in an inherently meaningless world? On first thought I think the answer is yes. The onus is on us to make sense of our world.
And here is where information, by itself, is not of value unless people place value on it. Just as people valued Wampum and Furs, and could therefore trade it for other goods, people trade information for other itemsof value. But the onus is on us to make sense of the information, to determine the meaning of it, and use thisto help drive business or other accomplishments.
Are you leveraging information as well as investors leverage other people's money? If not, IBM can help.
Rather than a target weight, I chose a target waist measurement, but did not quite make this one. I did keep up with my weekly exercise regime, but we recently installed an "ice cream freezer" here at work, and I have failed to resist temptation.
Reduce, Reuse and Recycle
In my post [Stayingon Budget], I resolved to "reduce, reuse and recycle". I have taken measures to de-clutter and simplify mylife, and already things are paying off. So I am happy about this one.
Learn to Better use Lotus Notes and Office 2007 software
In my post [Honeyour Tools and Skills], I resolved to learn how to better use Lotus Notes and Office 2007. We never got Office 2007.In a surprise move, IBM put out Lotus Symphony, an Office 2007 replacement. Lotus Symphony works on IBM's three approved recognized desktop platforms (Windows XP, Linux and Mac OS X). Here's a collection of [IBM Press Releases about Lotus Symphony].
I did learn how to better use Lotus Notes,thanks to Alan Lepofsky's blog [IBM Lotus Notes Hints, Tips, and Tricks].Ironically, the best help for dealing with Lotus Notes was not the software itself, but the skills in handling emailin general. This includes:
Resist the urge to copy the world, and better use "bcc" to be kind to upper management on "reply all" respondents.
Avoid attaching large documents, but use URL's to NAS file shares, websites, or [YouSendIt.com] instead. Obviously, the recipient has to have access to whatever you point to, but it greatly reduces total email volume and improves transmission over wireless.
Delegate. A lot of times I was the "middleman" between someone asking a question, and someone else Iknew had the answer. Now, I just introduce them together and step out of the way.
Checking email only a few times a day. I use to check my email every 5-10 minutes, now only 2-4 times per day.
In my post, [Lighten Up], I resolved to laugh more, stretch more, get enough sleep, and listen to music more. I participated in monthly[Tucson Laughter Club]events, incorporated stretching in my weekly exercise program, have gotten more sleep, and rediscovered some of my older music that I hadn't listened to in a while. Overall, I feel happy I met this one.
My New Year's Resolutions for 2008:
Improve my writing skills
Going back through my past blog postings, some of my sentences and paragraphs were frightful. I resolve toimprove my sentence and paragraph structure, and make better use of HTML tags to improve the layout andformatting.
Improve my HTML and Web design skills
Contribute to the OLPC Foundation
Last year, as a "Day 1 Donor", I had donated to this important charitable organization to help educate the childrenof third world nations. This year, I plan to learn Python and other programming languages used on the XO laptop,and see how I can contribute my skills and expertise on the OLPC forums.
Eat Healthier and Drink more
I think my downfall with last year's resolution was that it was merely a goal, 35 inch waist, rather thana "call for action". This year, I plan to eat more fish, salads, whole grains and other heart-healthy foods.
While many people resolve to "Quit Drinking", I need to drink more. My doctor, my personaltrainer, and even my interpreter teams, have asked me to do so. We live in Tucson, Arizona, during a centuryof global warming, and dehydration can cause stress on the body.
Attend more movies and film-making events
Last year, I joined the Tucson Film Society, and produced[my first film], part of which was filmedfrom Bogota, Colombia. I got invited to see a lot of independent films, premieres, and film-maker events, but did not attend many. I resolve to attend more in 2008.
Get better Organized
Moving offices from one building to another brought to light that I wasn't well organized. While I havemade some efforts to de-clutter my home, I need to step this up to my work as well.
I decided to start with something very non-tech, a [Hipster PDA]. I have nowmet or heard several people who use this approach successfully, and have decided to give it a try.
Hopefully, this list might inspire you to come up with your own resolutions. Not surprisingly, writing them in a public forum helped me keep most of them, and stick to my resolutions throughout the year.
Whew! I am glad that is over. The BarryB circus has left town, he has decided to [move on to other topics], and I am now to clean up the ["circus gold"] leftbehind. I would like to remind everyone that all of these discussions have been about the architecture,not the product. IBM will come out withits own version of a product based on Nextra later in 2008, which may be different than the product that XIV currentlysells to its customers.
RAID-X does not protect against double-drive failures as well as RAID-6, but it's very close
BarryB calls this the "Elephant in the room", that RAID-6 protects better against double-drive failures. I don't dispute that. He also credits me with the term "RAID-X", but I got this directly from the XIV guys. It turns out this was already a term used among academic research circles for [distributed RAID environments]. Meanwhile, Jon Toigo feels the term RAID-X sounds like a brand of bug spray in his post[XIV Architecture: What’s Not to Like?]Perhaps IBM can change this to RAID-5.99 instead.
If you measure risk of a second drive failing during the rebuild or re-replication process ofa first drive failure, you can measure the exposure by multiplying the amount of GB at risk by thenumber of hours that the second failure could occur, resulting in a unit of "GB-hours". Here Ilist best-case rebuild times, your mileage may vary depending on whether other workloads existon the system competing for resources. Notice that 8-disk configurations of RAID-10 and RAID-5for smaller FC disk are in the triple digits, and larger SATA disk in five digits, but that with RAID-X it is only single digits. That is orders of magnitude closer to the ideal.
For each RAID type, the risk is proportional to the square of the individual drive size.Double the drive size causes the risk to be four times greater.This is not the first time this has been discussed. In [Is RAID-5 Getting Old?], Ramskovquotes NetApp's response in Robin Harris' [NetApp Weighs In On Disks]:
...protecting online data only via RAID 5 today verges on professional malpractice.
As disks get older, RAID-6 will not be able to protect against 3-drive failures. A similar chartabove could show the risk to data after the second drive fails and both rebuilds are going on,compared to the risk of a third drive failure during this time. The RAID-X scheme protects muchbetter against 3-drive failures than RAID-6.
Nothing in the Nextra architecture prevents a RAID-6, Triple-copy, or other blob-level scheme
In much the same way that EMC Centera is RAID-5 based for its blobs, there is nothing in the Nextra architecturethat prevents taking additional steps to provide even better protection, using a RAID-6 scheme, making three copiesof the data instead of two copies, or something even more advanced. The current two-copy scheme for RAID-X is betterthan all the RAID-5 and RAID-10 systems out in the marketplace today.
Mirrored Cache won't protect against Cosmic rays, but ECC detection/correction does
BarryB incorrectly states that since some implementations of cache are non-mirrored, that this implies they are unprotected against Cosmic rays. Mirroring does not protect against bit-flips unless both copies arecompared for differences. Unfortunately, even if you compared them, the best you can do is detect theyare different, there is no way of knowing which version is correct.Mirroring cache is normally done to protect uncommitted writes. Reads in cacheare expendable copies of data already written to disk, so ECC detection/correction schemes are adequateprotection. ECC is like RAID for DRAM memory. A single bit-flip can be corrected, multiple bit-flipscan be detected. In the case of detection, the cache copy is discarded and read fresh again from disk.IBM DS8000, XIV and probably most other major vendor offerings use ECC of some kind. BarryB is correctthat some cheaper entry-level and midrange offerings from other vendors might cut corners in this area.I don't doubt BarryB's assertion that the ECC method used in the EMC products may be differently implemented than theECC in the IBM DS8000, but that doesn't mean the IBM DS8000's ECC implementation is flawed.
ECC protection is important for all RAID systems that perform rebuild, and even more importantthe larger the GB-hours listed in the table above.
XIV is designed for high-utilization, not less than 50 percent
I mentioned that the typical Linux, UNIX or Windows LUN is only 30-50 percent full, and perhaps BarryBthought I was referring to the typical "XIV customer". This average is for all disk storage systems connectedto these operating systems, based on IBM market research and analyst reports. The XIV is expected to run at much higher utilization rates, and offers features like "thin provisioning" and "differential snapshot" to make this simple to implement in practice.
Most often, disks don't fail without warning. Usually, they give out temporary errors first, and then fail permanently.The XIV architecture allows for pre-emptive self-repair, initiating the re-replication process after detecting temporary errors, rather than waiting for a complete drive failure.
I had mentioned that this process used "spare capacity, not spare drives" but I was notified that there are three spare drives per system to ensure that there is enough spare capacity, so I stand corrected.
New drives don't have to match the same speed/capacity as the new drives, so three to five years from now, whenit might be hard to find a matching 500GB SATA drive anymore, you won't have to.
No RAID scheme eliminates backups or Business Continuity Planning
The XIV supports both synchronous and asynchronous disk mirroring to remote locations. Backup software willbe able to backup data from the XIV to tape. A double drive failure would require a "recovery action", eitherfrom the disk mirror, or from tape, for the few GB of data that need to be recovered.
A third alternative is to allow end-users to receive backups of their own user-generated content. For example, I have over 15,000 photos uploaded over the past six years to Kodak Photo Gallery, which I use to share with my friends and family. For about $180 US dollars, they will cut DVDs containing all of my uploaded files and send them to me, so that I do not have to worry about Kodak losing my photos.In many cases, if a company or product fails to deliver on its promises, the most you will get is your money back, but for "free services" like HotMail, FreeDrive, FlickR and others, you didn't pay anything in the first place, andthey may point this limitation of liability in the "terms of service".
XIV can be used for databases and other online transaction processing
The XIV will have FCP and iSCSI interfaces, and systems can use these to store any kind of data you want. I mentionedthat the design was intended for large volumes of unstructured digital content, but there is nothing to prevent the use of other workloads. In today's Wall Street Journal article[To Get Back Into the Storage Game, IBM Calls In an Old Foe]:
Today, XIV's Nextra system is used by Bank Leumi, a large Israeli bank, and a few other customers for traditional data-storage tasks such as recording hundreds of transactions a minute.
BarryB, thanks for calling the truce. I look forward to talking about other topics myself. These past two weeks have been exhausting!
In my post yesterday [Spreading out the Re-Replication process], fellow blogger BarryB [aka The Storage Anarchist]raises some interesting points and questions in the comments section about the new IBM XIV Nextra architecture.I answer these below not just for the benefit of my friends at EMC, but also for my own colleagues within IBM,IBM Business Partners, Analysts and clients that might have similar questions.
If RAID 5/6 makes sense on every other platform, why not so on the Web 2.0 platform?
Your attempt to justify the expense of Mirrored vs. RAID 5 makes no sense to me. Buying two drives for every one drive's worth of usable capacity is expensive, even with SATA drives. Isn't that why you offer RAID 5 and RAID 6 on the storage arrays that you sell with SATA drives?
And if RAID 5/6 makes sense on every other platform, why not so on the (extremely cost-sensitive) Web 2.0 platform? Is faster rebuild really worth the cost of 40+% more spindles? Or is the overhead of RAID 6 really too much for those low-cost commodity servers to handle.
Let's take a look at various disk configurations, for example 3TB on 750GB SATA drives:
JBOD: 4 drives
JBOD here is industry slang for "Just a Bunch of Disks" and was invented as the term for "non-RAID".Each drive would be accessible independently, at native single-drive speed, with no data protection. Puttingfour drives in a single cabinet like this provides simplicity and convenience only over four separate drivesin their own enclosures.
RAID-10: 8 drives
RAID-10 is a combination of RAID-1 (mirroring) and RAID-0 (striping). In a 4x2 configuration, data is striped across disks 1-4,then these are mirrored across to disks 5-8. You get performance improvement and protection against a singledrive failure.
RAID-5: 5 drives
This would be a 4+P configuration, where there would be four drives' worth of data scattered across fivedrives. This gives you almost the same performance improvement as RAID-10, similar protection againstsingle drive failure, but with fewer drives per usable TB capacity.
RAID-6: 6 drives
This would be a 4+2P configuration, where the first P represents linear parity, and the second represents a diagonal parity. Similar in performance improvement as RAID-5, but protects against single and double drive failures, and still better than RAID-10 in terms of drives per TB usable capacity.
For all the RAID configurations, rebuild would require a spare drive, but often spares are shared among multiple RAID ranks, not dedicated to a single rank. To this end, you often have to have several spares per I/O loop, and a different set of spares for each kind of speed and capacity. If you had a mix of 15K/73GB, 10K/146GB, and 7200/500GB drives, then you would have three sets of spares to match.
In contrast, IBM XIV's innovative RAID-X approach doesn't requireany spare drives, just spare capacity on existing drives being used to hold data. The objects can be mirroredbetween any two types of drives, so no need to match one with another.
All of these RAID levels represent some trade-off between cost, protection and performance, and IBM offers each of theseon various disk systems platforms. Calculating parity is more complicated than just mirrored copies, but this can be done with specialized chips in cache memory to minimize performance impact.IBM generally recommends RAID-5 for high-performance FC disk, and RAID-6 for slower, large capacity SATA disk.
However, the questionassumes that the drive cost is a large portion of the overall "disk system" cost. It isn't. For example,Jon Toigo discusses the cost of EMC's new AX4 disk system in his post [National Storage Rip-Off Day]:
EMC is releasing its low end Clariion AX4 SAS/SATA array with 3TB capacity for $8600. It ships with four 750GB SATA drives (which you and I could buy at list for $239 per unit). So, if the disk drives cost $956 (presumably far less for EMC), that means buyers of the EMC wares are paying about $7700 for a tin case, a controller/backplane, and a 4Gbps iSCSI or FC connector. Hmm.
Dell is offering EMC’s AX4-5 with same configuration for $13,000 adding a 24/7 warranty.
(Note: I checked these numbers. $8599 is the list price that EMC has on its own website. External 750GB drivesavailable at my local Circuit City ranged from $189 to $329 list price. I could not find anything on Dell'sown website, but found [The Register] to confirm the $13,000 with 24x7 warranty figure.)
Disk capacity is a shrinking portion of the total cost of ownership (TCO). In addition to capacity, you are paying forcache, microcode and electronics of the system itself, along with software and services that are included in the mix,and your own storage administrators to deal with configuration and management. For more on this, see [XIV storage - Low Total Cost of Ownership].
EMC Centera has been doing this exact type of blob striping and protection since 2002
As I've noted before, there's nothing "magic" about it - Centera has been employing the same type of object-level replication for years. Only EMC's engineers have figured out how to do RAID protection instead of mirroring to keep the hardware costs low while not sacrificing availability.
I agree that IBM XIV was not the first to do an object-level architecture, but it was one of the first to apply object-level technologies to the particular "use case" and "intended workload" of Web 2.0 applications.
RAID-5 based EMC Centera was designed insteadto hold fixed-content data that needed to be protected for a specific period of time, such as to meet government regulatory compliance requirements. This is data that you most likelywill never look at again unless you are hit with a lawsuit or investigation. For this reason, it is important to get it on the cheapest storage configuration as possible. Before EMC Centera, customers stored this data on WORM tape and optical media, so EMC came up with a disk-only alternative offering.IBM System Storage DR550 offers disk-level access for themost recent archives, with the ability to migrate to much less expensive tape for the long term retention. The end result is that storing on a blended disk-plus-tape solution can help reduce the cost by a factor of 5x to 7x, making RAID level discussion meaningless in this environment. For moreon this, see my post [OptimizingData Retention and Archiving].
While both the Centera and DR550 are based on SATA, neither are designed for Web 2.0 platforms.When EMC comes out with their own "me, too" version, they will probably make a similar argument.
IBM XIV Nextra is not a DS8000 replacement
Nextra is anything but Enterprise-class storage, much less a DS8000 replacement. How silly of all those folks to suggest such a thing.
I did searches on the Web and could not find anybody, other than EMC employees, who suggested that IBM XIV Nextra architecture represented a replacement for IBM System Storage DS8000. The IBM XIV press release does not mentionor imply this, and certainly nobody I know at IBM has suggested this.
The DS8000 is designed for a different "use case" andset of "intended workloads" than what the IBM XIV was designed for. The DS8000 is the most popular disk systemfor our IBM System z mainframe platform, for activities like Online Transaction Processing (OLTP) and large databases, supporting ESCON and FICON attachment to high-speed 15K RPM FC drives. Web 2.0 customers that might chooseIBM XIV Nextra for their digital content might run their financial operations or metadata search indexes on DS8000.Different storage for different purposes.
As for the opinion that this is not "enterprise class", there are a variety of definitions that refer to this phrase.Some analysts look at "price band" of units that cost over $300,000 US dollars. Other analysts define this as beingattachable to mainframe servers via ESCON or FICON. Others use the term to refer to five-nines reliability, havingless than 5 minutes downtime per year. In this regard, based on the past two years experience at 40 customer locations,I would argue that it meets this last definition, with non-disruptive upgrades, microcode updates and hot-swappable components.
By comparison, when EMC introduced its object-level Centera architecture, nobody suggested it was the replacement for their Symmetrix or CLARiiON devices. Was it supposed to be?
Given drive growth rates have slowed, improving utilization is mandatory to keep up with 60-70 percent CAGR
Look around you, Tony- all of your competitors are implementing thin provisioning specifically to drive physical utilization upwards towards 60-80%, and that's on top of RAID 5/RAID 6 storage and not RAID 1. Given that disk drive growth rates and $/GB cost savings have slowed significantly, improving utilization is mandatory just to keep up with the 60-70% CAGR of information growth.
Disk drive capacities have slowed for FC disk because much of the attention and investment has been re-directed to ATA technology. Dollar-per-GB price reduction is slowing for disks in general, as researchers are hitting physicallimitations to the amount of bits they can pack per square inch of disk media, and is now around 25 percent per year.The 60-70 percent Compound Annual Growth Rate (CAGR) is real, and can be even growing faster for Web 2.0providers. While hardware costs drop, the big ticket items to watch will be software, services and storage administrator labor costs.
To this end, IBM XIV Nextra offers thin provisioning and differential space-efficient snapshots. It is designed for 60-90 percent utilization, and can be expanded to larger capacities non-disruptively in a very scalable manner.
On his The Storage Architect blog, Chris Evans wrote [Twofor the Price of One]. He asks: why use RAID-1 compared to say a 14+2 RAID-6 configuration which would be much cheaper in terms of the disk cost? Perhpaps without realizing it, answers itwith his post today [XIV part II]:
So, as a drive fails, all drives could be copying to all drives in an attempt to ensure the recreated lost mirrors are well distributed across the subsystem. If this is true, all drives would become busy for read/writes for the rebuild time, rather than rebuild overhead being isolated to just one RAID group.
Let me try to explain. (Note: This is an oversimplification of the actual algorithm in an effortto make it more accessible to most readers, based on written materials I have been provided as partof the acquisition.)
In a typical RAID environment, say 7+P RAID-5, you might have to read 7 drives to rebuild one drive, and in the case of a 14+2 RAID-6, reading 15 drives to rebuild one drive. It turns out the performance bottleneck is the one driveto write, and today's systems can rebuild faster Fibre Channel (FC) drives at about 50-55 MB/sec, and slower ATA disk at around 40-42 MB/sec. At these rates, a 750GB SATA rebuild would take at least 5 hours.
In the IBM XIV Nextra architecture, let's say we have 100 drives. We lose drive 13, and we need to re-replicate any at-risk 1MB objects.An object is at-risk if it is the last and only remaining copy on the system. A 750GB that is 90 percent full wouldhave 700,000 or so at-risk object re-replications to manage. These can be sorted by drive. Drive 1 might have about 7000 objects that need re-replication, drive 2might have slightly more, slightly less, and so on, up to drive 100. The re-replication of objects on these other 99 drives goes through three waves.
Select 49 drives as "source volumes", and pair each randomly with a "destination volume". For example, drive 1 mapped todrive 87, drive 2 to drive 59, and so on. Initiate 49 tasks in parallel, each will re-replicate the blocks thatneed to be copied from the source volume to the destination volume.
50 volumes left.Select another 49 drives as "source volumes", and pair each with a "destination volume". For example, drive 87 mapped todrive 15, drive 59 to drive 42, and so on. Initiate 49 tasks in parallel, each will re-replicate the blocks thatneed to be copied from the source volume to the destination volume.
Only one drive left. We select the last volume as the source volume, pair it off with a random destination volume,and complete the process.
Each wave can take as little as 3-5 minutes. The actual algorithm is more complicated than this, as tasks complete early the source and volumes drives are available for re-assignment to another task, but you get the idea. XIV hasdemonstrated the entire process, identifying all at-risk objects, sorting them by drive location, randomly selectingdrive pairs, and then performing most of these tasks in parallel, can be done in 15-20 minutes. Over 40 customershave been using this architecture over the past 2 years, and by now all have probably experienced at least adrive failure to validate this methodology.
In the unlikely event that a second drive fails during this short time, only one of the 99 task fails. The other 98 tasks continue to helpprotect the data. By comparison, in a RAID-5 rebuild, no data is protected until all the blocks are copied.
As for requiring spare capacity on each drive to handle this case, the best disks in production environments aretypically only 85-90 percent full, leaving plenty of spare capacity to handle re-replication process. On average,Linux, UNIX and Windows systems tend to only fill disks 30 to 50 percent full, so the fear there is not enough sparecapacity should not be an issue.
The difference in cost between RAID-1 and RAID-5 becomes minimal as hardware gets cheaper and cheaper. For every $1 dollar you spend on storage hardware, you spend $5-$8 dollars managing the environment. As hardware gets cheaper still, it might even be worth making three copies of every 1MB object, the parallel processto perform re-replications would be the same. This could be done using policy-based management, some data gets triple-copied, and other data gets only double-copied, based on whether the user selected "premium" or "basic" service.
The beauty of this approach is that it works with 100 drives, 1000 drives, or even a million drives. Parallel processingis how supercomputers are able to perform feats of amazing mathematical computations so quickly, and how Web 2.0services like Google and Yahoo can perform web searches so quickly. Spreading the re-replication process acrossmany drives in parallel, rather than performing them serially onto a single drive, is just one of the many uniquefeatures of this new architecture.
Wrapping up my week's theme on IBM's acquisition XIV, we have gotten hundreds of positive articles and reviews in the press, but has caused quite a stir with the[Not-Invented-Here] folks at EMC.We've heard already from EMC bloggers [Chuck Hollis] and [Mark Twomey].The latest is fellow EMC blogger BarryB's missive [Obligatory "IBM buys XIV" Post], which piles on the "Fear, Uncertainty and Doubt" [FUD], including this excerpt here:
In a block storage device, only the host file system or database engine "knows" what's actually stored in there. So in the Nextra case that Tony has described, if even only 7,500-15,000 of the 750,000 total 1MB blobs stored on a single 750GB drive (that's "only" 1 to 2%) suddenly become inaccessible because the drive that held the backup copy also failed, the impact on a file system could be devastating. That 1MB might be in the middle of a 13MB photograph (rendering the entire photo unusable). Or it might contain dozens of little files, now vanished without a trace. Or worst yet, it could actually contain the file system metadata, which describes the names and locations of all the rest of the files in the file system. Each 1MB lost to a double drive failure could mean the loss of an enormous percentage of the files in a file system.
And in fact, with Nextra, the impact will be across not just one, but more likely several dozens or even hundreds of file systems.
Worse still, the Nextra can't do anything to help recover the lost files.
Nothing could be further from the truth. If any disk drive module failed, the system would know exactly whichone it was, what blobs (binary large objects) were on it, and where the replicated copies of those blobs are located. In the event of a rare double-drive failure, the system would know exactly which unfortunate blobs were lost, and couldidentify them by host LUN and block address numbers, so that appropriate repair actions could be taken from remote mirrored copies or tape file backups.
Second, nobody is suggesting we are going to put a delicateFAT32-like Circa-1980 file system that breaks with the loss of a single block and requires tools like "fsck" to piece back together. Today's modern file systems--including Windows NTFS, Linux ext3, and AIX JFS2--are journaled and have sophisticated algorithms tohandle the loss of individual structure inode blocks. IBM has its own General Parallel File System [GPFS] and corresponding Scale out File Services[SOFS], and thus brings a lotof expertise to the table.Advanced distributed clustered file systems, like [Google File System] and Yahoo's [Hadoop project] take this one step further, recognizing that individual node and drive failures at the Petabyte-scale are inevitable.
In other words, XIV Nextra architecture is designed to eliminate or reduce recovery actions after disk failures, not make them worse. Back in 2003, when IBM introduced the new and innovative SAN Volume Controller (SVC), EMCclaimed this in-band architecture would slow down applications and "brain-damage" their EMC Symmetrix hardware.Reality has proved the opposite, SVC can improve application performance and help reduce wear-and-tear on the manageddevices. Since then, EMC acquired Kashya to offer its own in-band architecture in a product called EMC RecoverPoint, that offers some of the features that SVC offers.
If you thought fear mongering like this was unique to the IT industry, consider that 105years ago, [Edison electrocuted an elephant]. To understand this horrific event, you have to understand what was going on at the time.Thomas Edison, inventor of the light bulb, wanted to power the entire city of New York with Direct Current(DC). Nikolas Tesla proposed a different, but more appropriate architecture,called Alternating Current(AC), that had lower losses over distances required for a city as large and spread out as New York. But Thomas Edison was heavily invested in DC technology, and would lose out on royalties if ACwas adopted.In an effort to show that AC was too dangerous to have in homes and businesses, Thomas Edison held a pressconference in front of 1500 witnesses, electrocuting an elephant named Topsy with 6600 volts, and filmed the event so that it could be shown later to other audiences (Edison invented the movie camera also).
Today's nationwide electric grid would not exist without Alternating Current.We enjoy both AC for what it is best used for, and DC for what it is best used for. Both are dangerous at high voltage levels if not handled properly. The same is the case for storage architectures. Traditional high-performance disk arrays, like the IBM System Storage DS8000, will continue to be used for large mainframe applications, online transaction processing and databases. New architectures,like IBM XIV Nextra, will be used for new Web 2.0 applications, where scalability, self-tuning, self-repair,and management simplicity are the key requirements.
(Update: Dear readers, this was meant as a metaphor only, relating the concerns expressed above thatthe use of new innovative technology may result in the loss or corruption of "several dozen or even hundreds of file systems" and thus too dangerous to use, with an analogy on the use of AC electricity was too dangerous to use in homes. To clarify, EMC did not re-enact Thomas Edison's event, no animalswere hurt by EMC, and I was not trying to make political commentary about the current controversy of electrocution as amethod of capital punishment. The opinions of individual bloggers do not necessarily reflect the official positions of EMC, and I am not implying that anyone at EMC enjoys torturing animals of any size, or their positions on capital punishment in general. This is not an attack on any of the above-mentioned EMC bloggers, but rather to point out faulty logic. Children should not put foil gum wrappers in electrical sockets. BarryB and I have apologized to each other over these posts for any feelings hurt, and discussion should focus instead on the technologies and architectures.)
While EMC might try to tell people today that nobody needs unique storage architectures for Web 2.0 applications, digital media and archive data, because their existing products support SATA disk and can be used instead for these workloads, they are probably working hard behind the scenes on their own "me, too" version.And with a bit of irony, Edison's film of the elephant is available on YouTube, one of the many Web 2.0 websites we are talking about. (Out of a sense of decency, I decided not to link to it here, so don't ask)
Yesterday's announcement that IBM had acquired XIV to offer storage for Web 2.0 applicationsprompted a lot of discussion in both the media and the blogosphere. Several indicated thatit was about time that one of the major vendors stepped forward to provide this, and it madesense that IBM, the leader in storage hardware marketshare, would be the first. Others were perhaps confused on what is unique with Web 2.0 applications. What has changed?
I'll use this graphic to help explain how we have transitioned through three eras of storage.
The first era: Server-centric
In the 1950s, IBM introduced both tape and disk systems into a very server-centric environment.Dumb terminals and dumb storage devices were managed entirely by the brains inside the server.These machines were designed for Online Transaction Processing (OLTP), everywhere from bookingflights on airlines to handling financial transfers.
The second era: Network-centric
In the 1980s and 1990s, dumb terminals were replaced with smarter workstations and personalcomputers; and dumb storage were replaced with smarter storage controllers. Local Area Networks (LANs)and Storage Area Networks (SANs) allowed more cooperative processing between users, servers andstorage. However, servers maintained their role as gatekeepers. Users had to go through aspecific server or server cluster to access the storage they had access to. These servers continuedtheir role in OLTP, but also manage informational databases, file sharing and web serving.
The third era: Information-centric
Today, we are entering a third era. Servers are no longer the gatekeepers. Smart workstationsand personal computers are now supplemented with even more intelligent handheld devices, Blackberryand iPhones, for example. Storage is more intelligent too, with some being able to offer file sharingand web serving directly, without the need of an intervening server. The roles of servers have changed,from gatekeepers, to ones that focuses on crunching the numbers, and making information presentableand useful.
Here is where Web 2.0 applications, digital media and archives fits in. These are focused on unstructured data that don't require relational database management systems. So long as the useris authorized, subscribed and/or has made the appropriate payment, she can access the information. With the appropriate schemes in place, information can now be mashed-up in a variety of ways, combined with other information that can render insights and help drive new innovations.
Of course, we will still have databases and online transaction processing to book our flights andtransfer our funds, but this new era brings in new requirements for information storage, and newarchitectures that help optimize this new approach.
Well, it's 2008, which could mark the end to RAID5 and mark the beginnings of a new disk storagearchitecture. IBM starts the year with exciting news, acquiring new disk technology from a smallstart-up called XIV, led by former-EMCer Moshe Yanai. Moshe was ousted publicly in 2001 from hisposition as EMC's VP of engineering, and formed his own company. It didn't take long for EMC bloggersto poke fun at this already. Mark Twomey, in his StorageZilla blog, had mentioned XIV before back in August,[XIV], and again todayin [IBM Buys XIV].
To address the new requirements associated with next generation digital content, IBM chose XIV and its NEXTRA™ architecture for its ability to scale dynamically, heal itself in the event of failure, and self-tune for optimum performance, all while eliminating the significant management burden typically associated with rapid growth environments. The architecture also is designed to automatically optimize resource utilization of all the components within the system, which can allow for easier management and configuration and improved performance and data availability.
"We are pleased to become a significant part of the IBM family, allowing for our unique storage architecture, our engineers and our storage industry experience to be part of IBM's overall storage business," said Moshe Yanai, chairman, XIV. "We believe the level of technological innovation achieved by our development team is unparalleled in the storage industry. Combining our storage architectural advancements with IBM's world-wide research, sales, service, manufacturing, and distribution capabilities will provide us with the ability to have these technologies tackle the emerging Web 2.0 technology needs and reach every corner of the world."
The NEXTRA architecture has been in production for more than two years, with more than four petabytes of capacity being used by customers today.
Current disk arrays were designed for online transaction processing (OLTP) databases. The focus was onusing fastest most expensive 10K and 15K RPM Fibre Channel drives, with clever caching algorithmsfor quick small updates of large relational databases. However, the world is changing, and peoplenow are looking for storage designed for digital media, archives, and other Web 2.0 applications.
One problem that NEXTRA architecture addresses is RAID rebuild. In a standard RAID5 6+P+S configuration of 146GB 10K RPM drives, the loss of one disk drive module (DDM) was recovered by reconstructing the data from parity of the other drives onto the spare drive. The process took46 minutes or longer, depending on how busy the system was doing other things. During this time,if a second drive in the same rank fails, all 876GB of data are lost. Double-drive failures are rare,but unpleasant when they happen, and hopefully you have a backup on tape to recover the data from.Moving to slower, less expensive SATA drives made this situation worse. The drives have highercapacity, but run at slower speeds. When a SATA drive fails in a RAID5 array, it could take severalhours to rebuild, and that is more time exposure for a second drive failure. A rebuild for a 750GBSATA drive would take five hours or more,with 4.5 TB of data at risk during the process if a second drive failure occurs.
The Nextra architecture doesn't use traditional RAID ranks or spare DDMs. Instead, data is carved up into 1MBobjects, and each object is stored on two physically-separate drives. In the event of a DDM loss, allthe data is readable from the second copies that are spread across hundreds of drives. New copies aremade on the empty disk space of the remaining system. This process can be done for a lost 750GB drive in under20 minutes. A double-drive failure would only lose those few objects that were on both drives, so perhaps1 to 2 percent of the total data stored on that logical volume.
Losing 1 to 2 percent of data might be devastating to a large relational database, as this could impactthe entire access to the internal structure. However, this box was designed for unstructuredcontent, like medical images, music, videos, Web pages, and other discrete files. In the event of a double-drivefailure, individual files would be recovered, such as with IBM Tivoli Storage Manager backup software.
IBM will continue to offer high-speed disk arrays like the IBM System Storage DS8000 and DS4800 for OLTP applications, and offer NEXTRA for this new surge in digital content of unstructured data. Recognizing this trend, diskdrive module manufacturers will phase out 10K RPM drives, and focus on 15K RPM for OLTP, and low-speedSATA for everything else.
Update: This blog post was focused on the version of XIV box available as of January 2008 that was built by XIV prior to the IBM acquisition. IBM has since made a major revision, made available August 2008 thataddresses a variety of workloads, including database, OLTP, email, as well as digital content and unstructuredfiles. Contact your IBM or IBM Business Partner for the latest details!
Bottom line, IBM continues to celebrate the new year, while the EMC folks in Hopkington, MA will continue to nurse their hangovers. Now that's a good way to start the new year!