Last year, I started my post[Hu Yoshida should know better
I am still wiping the coffee off my computer screen, inadvertently sprayed when I took a sip while reading HDS' uber-blogger Hu Yoshida's post on storage virtualization and vendor lock-in.
HDS is a major vendor for disk storage virtualization, and Hu Yoshida has been around for a while, so I felt it was fair to disagree with some of the generalizations he made to set the record straight. He's been more careful ever since.
However, his latest post [The Greening of IT: Oxymoron or Journey to a New Reality] mentions an expert panel at SNW that includedMark O’Gara Vice President of Infrastructure Management at Highmark. I was not at the SNW conference last week in Orlando, so I will just give the excerpt from Hu's account of what happened:
"Later I had the opportunity to have lunch with Mark O’Gara. Mark is a West Point graduate so he takes a very disciplined approach to addressing the greening of IT. He emphasized the need for measurements and setting targets. When he started out he did an analysis of power consumption based on vendor specifications and came up with a number of 513 KW for his data center infrastructure....
The physical measurements showed that the biggest consumers of power were in order: Business Intelligence Servers, SAN Storage, Robotic tape Library, and Virtual tape servers....
Another surprise may be that tape libraries are such large consumers of power. Since tape is not spinning most of the time they should consume much less power than spinning disk - right? Apparently not if they are sitting in a robotic tape library with a lot of mechanical moving parts and tape drives that have to accelerate and decelerate at tremendous speeds. A Virtual Tape Library with de-duplication factor of 25:1 and large capacity disks may draw significantly less power than a robotic tape library for a given amount of capacity.
Obviously, I know better than to sip coffee whenever reading Hu's blog. I am down here in South America this week, the coffee is very hot and very delicious, so I am glad I didn't waste any on my laptop screen this time, especially reading that last sentence!
Last month, in my post [Disk only customers going back to tape], I mentioned some statistics from the Clipper Group's whitepaper[Disk and Tape Square Off Again —Tape Remains King of the Hill with LTO-4] by analysts David Reine and Mike Kahn.
In that report, a 5-year comparison found that a repository based on SATA disk was 23 times more expensive overall, and consumed 290 times more energy, than a tape library based on LTO-4 tape technology. The analysts even considered a disk-based Virtual Tape Library (VTL). Focusing just on backups, at a 20:1 deduplication ratio, the VTL solution was still 5 times per expensive than the tape library. If you use the 25:1 ratio that Hu Yoshida mentions in his post above, that would still be 4 times more than a tape library.
I am not disputing Mark O'Gara's disciplined approach.
It is possible that Highmark is using a poorly written backup program, taking full backups every day, to an older non-IBM tape library, in a manner that causes no end of activity to the poor tape robotics inside. But rather than changing over to a VTL, perhaps Mark might be better off investigating the use of IBM Tivoli Storage Manager, using progressive backup techniques, appropriate policies, parameters and settings, to a more energy-efficient IBM tape library.In well tuned backup workloads, the robotics are not very busy. The robot mounts the tape, and then the backup runs for a long time filling up that tape, all the meanwhile the robot is idle waiting for another request.
(Update: My apologies to Mark and his colleagues at Highmark. The above paragraph implied that Mark was using badproducts or configured them incorrectly, and was inappropriate. Mark, my full apology [here])
If you do decide to go with a Virtual Tape Library, for reasons other than energy consumption, doesn't it make sense to buy it from a vendor that understands tape systems, rather than buying it from one that focuses on disk systems? Tape system vendors like IBM, HP or Sun understand tape workloads as well as related backup and archive software, and can provide better guidance and recommendations based on years of experience. Asking advice abouttape systems, including Virtual Tape Libraries, from a disk vendor is like asking for advice on different types of bread from your butcher, or advice about various cuts of meat at the bakery.
The butchers and bakers might give you answers, but it may not be the best advice.
technorati tags: HDS, Hu Yoshida, Mark O'Gara, Highmark, SNW, Orlando, Florida, de-duplication, deduplication, dedupe, robotic, tape library, virtual, VTL, Clipper Group, David Reine, Mike Kahn, SATA, disk, systems, HP, Sun, backup, archive, workloads, butcher, baker, bakery, meat, bread, advice, IBM, Tivoli Storage Manager, TSM, LTO, LTO-4
The IBM Challenge was a big success. One of the contestants, Ken Jennings, [welcomes our new computer overlords]. Congratulations are in order to the IBM Research team who pulled off this Herculean effort!
Some folks have poked fun at some of the odd responses and wager amounts from the IBM Watson computer during the three-day tournament. Others were surprised as I was that the impressive feat was done with less than 1TB of stored data. Here is what John Webster wrote in CNET yesterday, in hist article [What IBM's Watson says to storage systems developers]:
"All well and good. But here's what I find most interesting as a result of what IBM has done in response to the Grand Challenge that motivated Watson's creators. We know, from Tony Pearson's blog, that the foundation of Watson's data storage system is a modified IBM SONAS cluster with a total of 21.6TB of raw capacity. But Pearson also reveals another very significant, and to me, surprising data point: "When Watson is booted up, the 15TB of total RAM are loaded up, and thereafter the DeepQA processing is all done from memory. According to IBM Research, the actual size of the data (analyzed and indexed text, knowledge bases, etc.) used for candidate answer generation and evidence evaluation is under 1 Terabyte."
What Pearson just said is that the data set Watson actually uses to reach his push-the-button decision would fit on a 1TB drive. So much for big data?"
To better appreciate how difficult the challenge was, and how a small amount of data can answer a billion different questions, I thought I would cover Business Intelligence, Data Retrieval and Text Mining concepts.
Let's start with Business Intelligence.
[Seth Grimes] pointed me to this quote from [A Business Intelligence System], written by Hans Peter Luhn back in October 1958 IBM Journal.
"In this paper, business is a collection of activities carried
on for whatever purpose, be it science, technology,
commerce, industry, law, government, defense, et cetera.
The communication facility serving the conduct of a business
(in the broad sense) may be referred to as an intelligence
system. The notion of intelligence is also defined
here, in a more general sense, as the ability to apprehend
the interrelationships of presented facts in such a way as
to guide action towards a desired goal."
Ideally, when you need "Business Intelligence" to help you make a better decision, you perform data retrieval from a structured database for the specific information you are looking for. In other cases, you might be looking for insight, patterns or trends. In that case, you go "data mining" against your structured databases.
Here's a simple example. John runs a fruit stand. One day, he kept track of how many apples and oranges were bought by men and women. How many questions can we ask against this small set of data? Let's count them:
- How many apples were sold to men?
- How many apples were sold to women?
- How many oranges were sold to men?
- How many oranges were sold to women?
But wait! For each row and column, we can combine them into totals.
- How many apples were sold in total?
- How many oranges were sold in total?
- How many fruit in total were sold to men?
- How many fruit in total were sold to women?
- How many fruit in total were sold?
|Total||63||50%||63||50%||126||But wait, there's more! Each row and column can be evaluated for relative percentages, as well as percentages of each cell compared to the total. You could make five relevant pie-charts from this data. This results in 16 more questions, such as:
- Of the fruit purchased by men, what percentage for apples?
- Of all the apples purchased, what percentage by women?
And that's not including more ethereal questions, such as:
- Are there gender-specific preferences for different types of fruit?
- What type of fruit do men prefer?
This is just for a small set, two market segments (by gender) and two products (apples and oranges). However, if you have many market segments (perhaps by age group, zip code, etc.) and many products, the number of queries that can be supported is huge. For small sets of data, you can easily do this with a spreadsheet program like IBM Lotus Symphony or Microsoft Excel.
(Photo courtesy of [OLAP, Cubes and Multidimensional Analysis] by Andrew Fryer.)
But why limit yourself to two dimensions? The above example was just for one day's worth of activity, if John captures this data for every day for historical and seasonal trending, it can be represented as a three-dimensional cube. The number of queries becomes astronomical. This is the basis for Online Analytical Processing (OLAP), and three-dimensional tables are often referred to as [OLAP cubes].
Back in 1970, IBM invented the Structured Query Language [SQL], and today, nearly all modern relational databases support this, including IBM DB2, Informix, Microsoft SQL Server, and Oracle DB. SQL poses two challenges. First, you had to structure the data in advance to the way you expect to perform your ad-hoc queries. Deciding the groups and categories in advance can limit the way information is recorded and captured.
Second, you had to be skilled at SQL to phrase your queries correctly to retrieve the data you are after. What ended up happening was that skilled SQL programmers would develop "canned reports" with fixed SQL parameters, so that less-skilled business decision makers could base their decisions from these reports.
IBM has fully integrated stacks to help process structured data, combining servers, storage, and advanced analytics software into a complete appliance. IBM offers the [Smart Analytics System] for robust, customized deployments, and recently acquired [Netezza] for pre-configured, and more rapid deployments.
However, the bigger problem is that more than 80 percent of information is not structured!
Semi-structured data like email provides some searchable fields like From and Subject. The rest of the information is unstructured, such as text files, photographs, video and audio. To look for specific information in unstructured sources can be like looking for a needle in a haystack, and trying to get insight, patterns or trends involves text mining.
IBM is a leader in Business Analytics and has made great progress in dealing with unstructured data. This includes [IBM OmniFind Enterprise Edition], [IBM e-Discovery Manager] and [IBM Cognos Business Intelligence].
This, in effect, is what IBM Watson was able to perform so well this week. Finding the needle in the haystacks of unstructured data from 200 million pages of text stored in its system, combined with the ability to apprehend the interrelationships of meaning and subtle nuance, resulted in an impressive technology demonstration. Certainly, this new technology will be powerful for a variety of use cases across a broad set of industries!
To learn more, read the Arizona Daily Star's article [After 'Jeopardy!' win, IBM program steps out].
technorati tags: IBM, Watson, Jeopardy, Challenge, John Webster, CNET, BI, data mining, Text Mining, OLAP, Arizona, Daily Star
A reader of my blog asked me what seemed like a simple enough question:
Whatever happened to Lotus Approach? I loved that personal db. (thoughit's been awhile...)
Of course, researching an answer, I encountered some interesting new information. Interestingly, everyone tries to "read between the lines" and tries to determine what solution is best.
From a colleague from Lotus:
You can still get [Lotus Approach] as part of Smartsuite.
However, I have to assume his real question is ... "what is the quick and easy way for me to build a lightweight database app like Microsoft Access that I can distribute as a standalone executable?"
To which I would say "Lotus has a program called Approach, which is part of Lotus SmartSuite, which some people still use. However, a lot of the focus in IBM now centers around the lightweight Cloudscape database which IBM acquired from Informix, which is now known as the [open source project called Derby]. Many IBM and Lotus products, such as Lotus Expeditor use the JDBC connection to Derby, which allows you to use Windows, Linux, Flash, etc. ... with no vendor lock in".
I am familiar with Cloudscape, and I evaluated it as a potential database for IBM TotalStorage Productivity Center, when I was the lead architect defining the version 1 release. It runs entirely on Java, which is both a plus and minus. Plus in that it runs anywhere Java runs, but a minus in that it is not optimized for high performance or large scalability. Because of this, we decided instead on using the full commercial DB2 database instead for Productivity Center.
Not to be undone, my colleagues over at DB2 offered a different alternative, [DB2 Express-C], which runs on a variety of Windows, Linux-x86, and Linux on POWER platforms. It is "free" as in beer, not free as in speech, which means you can download and use it today at no charge, and even ship products with it included, but you are not allowed to modify and distribute altered versions of it, as you can with "free as in speech" open source code, as in the case of Derby above (see [Apache License 2.0"] for details).
(If you have no idea what I am talking about in my distinction between "free speech" and "free beer", see Simon Phipps' article[Perspective: Free speech, free beer and free software] orthe definition from the [Free Software Foundation].)
As I see it, DB2 Express-C has two key advantages. First, if you like the free version, you can purchase a "support contract" for those that need extra hand-holding, or are using this as part of a commercial business venture. Second,for those who do prefer vendor lock-in, it is easyto upgrade Express-C to the full IBM DB2 database product, so if you are developing a product intended for use with DB2, you can develop it first with DB2 Express-C, and migrate up to full DB2 commercial version when you are ready.
This is perhaps more information than you probably expected for such a simple question. Meanwhile, I am stilltrying to figure out MySQL as part of my [OLPC volunteer project].
technorati tags: IBM, Lotus, Approach, SmartSuite, TotalStorage, Productivity Center, Cloudscape, Apache, Derby, free, speech, beer, DB2, Express-C, Windows, Linux, POWER, open source
Last week, I presented IBM's strategic initiative, the IBM Information Infrastructure, which is part of IBM's New Enterprise Data Center vision. This week, I will try to get around to talking about some of theproducts that support those solutions.
There has been a lot of attention on XIV in the past few weeks, so I will start with that. Steve Duplessie, anIT industry analyst from Enterprise Strategy Group (ESG) had a post [Adaptec buys Aristos, Tom Cruise, XIV, and Logical Assumptions] with some interesting observations and some sage advice.Val Bercovici on his NetApp Exposed blog, has a post [Has Storage Swift-Blogging Finally Jumped the Shark?] which blasts EMC for their negativity.
(For those not in the USA, swift-blogging is a reference tofalse accusations and negative remarks made during the U.S. 2004 presidential election by the[Swift Boat Veterans], and ["jumping the shark"] is a reference to [a TV show that ran out of interesting and relevant topics].For movie sequels, the comparable phrase is ["nuke the fridge"] in reference to the most recent Indiana Jones' movie.)
I was going to set the record straight on a variety of misunderstandings, rumors or speculations, but I think most have been taken care of already. IBM blogger BarryW covered the fact that SVC now supports XIV storage systems, in his post[SVC and XIV],and addressed some of the FUD already. Here was my list:
- Now that IBM has an IBM-branded model of XIV, IBM will discontinue (insert another product here)
I had seen speculation that XIV meant the demise of the N series, the DS8000 or IBM's partnership with LSI.However, the launch reminded people that IBM announced a new release of DS8000 features, new models of N series N6000,and the new DS5000 disk, so that squashes those rumors.
- IBM XIV is a (insert tier level here) product
While there seems to be no industry-standard or agreement for what a tier-1, tier-2 or tier-3 disk system is, there seemed to be a lot of argument over what pigeon-hole category to put IBM XIV in. No question many people want tier-1 performance and functionality at tier-2 prices, and perhaps IBM XIV is a good step at giving them this. In some circles, tier-1 means support for System z mainframes. The XIV does not have traditional z/OS CKD volume support, but Linux on System z partitions or guests can attach to XIV via SAN Volume Controller (SVC), or through NFS protocol as part of the Scale-Out File Services (SoFS) implementation.
Whenever any radicalgame-changing technology comes along, competitors with last century's products and architectures want to frame the discussion that it is just yet another storage system. IBM plans to update its Disk Magic and otherplanning/modeling tools to help people determine which workloads would be a good fit with XIV.
- IBM XIV lacks (insert missing feature here) in the current release
I am glad to see that the accusations that XIV had unprotected, unmirrored cache were retracted. XIV mirrors all writes in the cache of two separate modules, with ECC protection. XIV allows concurrent code loadfor bug fixes to the software. XIV offers many of the features that people enjoy in other disksystems, such as thin provisioning, writeable snapshots, remote disk mirroring, and so on.IBM XIV can be part of a bigger solution, either through SVC, SoFS or GMAS that provide thebusiness value customers are looking for.
- IBM XIV uses (insert block mirroring here) and is not as efficient for capacity utilization
It is interesting that this came from a competitor that still recommends RAID-1 or RAID-10 for itsCLARiiON and DMX products.On the IBM XIV, each 1MB chunk is written on two different disks in different modules. When disks wereexpensive, how much usable space for a given set of HDD was worthy of argument. Today, we sell you abig black box, with 79TB usable, for (insert dollar figure here). For those who feel 79TB istoo big to swallow all at once, IBM offers "capacity on demand" pricing, where you can pay initially for as littleas 22TB, but get all the performance, usability, functionality and advanced availability of the full box.
- IBM XIV consumes (insert number of Watts here) of energy
For every disk system, a portion of the energy is consumed by the number of hard disk drives (HDD) andthe remainder to UPS, power conversion, processors and cache memory consumption. Again, the XIV is a bigblack box, and you can compare the 8.4 KW of this high-performance, low-cost storage one-frame system with thewattage consumed by competitive two-frame (sometimes called two-bay) systems, if you are willing to take some trade-offs. To getcomparable performance and hot-spot avoidance, competitors may need to over-provision or use faster, energy-consuming FC drives, and offer additional software to monitor and re-balance workloads across RAID ranks.To get comparable availability, competitors may need to drop from RAID-5 down to either RAID-1 or RAID-6.To get comparable usability, competitors may need more storage infrastructure management software to hide theinherent complexity of their multi-RAID design.
Of course, if energy consumption is a major concern for you, XIV can be part of IBM's many blended disk-and-tapesolutions. When it comes to being green, you can't get any greener storage than tape! Blended disk-and-tapesolutions help get the best of both worlds.
Well, I am glad I could help set the record straight. Let me know what other products people you would like me to focus on next.
technorati tags: IBM, XIV, disk, storage, system, Steve Duplessie, ESG, Val Bercovici, NetApp, BarryW, SVC, DS8000, N6000, DS5000, mainframe, z/OS, CKD, SoFS, NFS, ECC, HDD, RAID, UPS, availability, reliability, performance, usability, blended disk-and-tape, green
It's official! My "blook" Inside System Storage - Volume I
is now available.
|This blog-based book, or “blook”, comprises the first twelve months of posts from this Inside System Storage blog,165 posts in all, from September 1, 2006 to August 31, 2007. Foreword by Jennifer Jones. 404 pages.|
- IT storage and storage networking concepts
- IBM strategy, hardware, software and services
- Disk systems, Tape systems, and storage networking
- Storage and infrastructure management software
- Second Life, Facebook, and other Web 2.0 platforms
- IBM’s many alliances, partners and competitors
- How IT storage impacts society and industry
You can choose between hardcover (with dust jacket) or paperback versions:
This is not the first time I've been published. I have authored articles for storage industry magazines, written large sections of IBM publications and manuals, submitted presentations and whitepapers to conference proceedings, and even had a short story published with illustrations by the famous cartoon writer[Ted Rall].
But I can say this is my first blook, and as far as I can tell, the first blook from IBM's many bloggers on DeveloperWorks, and the first blook about the IT storage industry.I got the idea when I saw [Lulu Publishing] run a "blook" contest. The Lulu Blooker Prize is the world's first literary prize devoted to "blooks"--books based on blogs or other websites, including webcomics. The [Lulu Blooker Blog] lists past year winners. Lulu is one of the new innovative "print-on-demand" publishers. Rather than printing hundredsor thousands of books in advance, as other publishers require, Lulu doesn't print them until you order them.
I considered cute titles like A Year of Living Dangerously, orAn Engineer in Marketing La-La land, or Around the World in 165 Posts, but settled on a title that matched closely the name of the blog.
In addition to my blog posts, I provide additional insights and behind-the-scenes commentary. If you go to the Luluwebsite above, you can preview an entire chapter in its entirety before purchase. I have added a hefty 56-page Glossary of Acronyms and Terms (GOAT) with over 900 storage-related terms defined, which also doubles as an index back to the post (or posts) that use or further explain each term.
So who might be interested in this blook?
- Business Partners and Sales Reps looking to give a nice gift to their best clients and colleagues
- Managers looking to reward early-tenure employees and retain the best talent
- IT specialists and technicians wanting a marketing perspective of the storage industry
- Mentors interested in providing motivation and encouragement to their proteges
- Educators looking to provide books for their classroom or library collection
- Authors looking to write a blook themselves, to see how to format and structure a finished product
- Marketing personnel that want to better understand Web 2.0, Second Life and social networking
- Analysts and journalists looking to understand how storage impacts the IT industry, and society overall
- College graduates and others interested in a career as a storage administrator
And yes, according to Lulu, if you order soon, you can have it by December 25.
technorati tags: IBM, blook, Volume I, Jennifer Jones, system, storage, strategy, hardware, software, services, disk, tape, networking, SAN, secondlife, Web2.0, facebook, Lulu, publishing, Blooker Prize, articles, magazines, proceedings, Ted Rall, insights, glossary, early-tenure, mentors, library, classroom, administrator, print, publish, on demand