Wrapping up my week's theme of storage optimization, I thought I would help clarify the confusion between data reduction and storage efficiency. I have seen many articles and blog posts that either use these two terms interchangeably, as if they were synonyms for each other, or as if one is merely a subset of the other.
- Data Reduction is LOSSY
By "Lossy", I mean that reducing data is an irreversible process. Details are lost, but insight is gained. In his paper, [Data Reduction Techniques", Rajana Agarwal defines this simply:
"Data reduction techniques are applied where the goal is to aggregate or amalgamate the information contained in large data sets into manageable (smaller) information nuggets."
Data reduction has been around since the 18th century.
Take for example this histogram from [SearchSoftwareQuality.com]. We have reduced ninety individual student scores, and reduced them down to just five numbers, the counts in each range. This can provide for easier comprehension and comparison with other distributions.
The process is lossy. I cannot determine or re-create an individual student's score from these five histogram values.
This next example, complements of [Michael Hardy], represents another form of data reduction known as ["linear regression analysis"]. The idea is to take a large set of data points between two variables, the x axis along the horizontal and the y axis along the vertical, and find the best line that fits. Thus the data is reduced from many points to just two, slope(a) and intercept(b), resulting in an equation of y=ax+b.
The process is lossy. I cannot determine or re-create any original data point from this slope and intercept equation.
In this last example, from [Yahoo Finance], reduces millions of stock trades to a single point per day, typically closing price, to show the overall growth trend over the course of the past year.
The process is lossy. Even if I knew the low, high and closing price of a particular stock on a particular day, I would not be able to determine or re-create the actual price paid for individual trades that occurred.
- Storage Efficiency is LOSSLESS
By contrast, there are many IT methods that can be used to store data in ways that are more efficient, without losing any of the fine detail. Here are some examples:
- Thin Provisioning: Instead of storing 30GB of data on 100GB of disk capacity, you store it on 30GB of capacity. All of the data is still there, just none of the wasteful empty space.
- Space-efficient Copy: Instead of copying every block of data from source to destination, you copy over only those blocks that have changed since the copy began. The blocks not copied are still available on the source volume, so there is no need to duplicate this data.
- Archiving and Space Management: Data can be moved out of production databases and stored elsewhere on disk or tape. Enough XML metadata is carried along so that there is no loss in the fine detail of what each row and column represent.
- Data Deduplication: The idea is simple. Find large chunks of data that contain the same exact information as an existing chunk already stored, and merely set a pointer to avoid storing the duplicate copy. This can be done in-line as data is written, or as a post-process task when things are otherwise slow and idle.
When data deduplication first came out, some lawyers were concerned that this was a "lossy" approach, that somehow documents were coming back without some of their original contents. How else can you explain storing 25PB of data on only 1PB of disk?
(In some countries, companies must retain data in their original file formats, as there is concern that converting business documents to PDF or HTML would lose some critical "metadata" information such as modificatoin dates, authorship information, underlying formulae, and so on.)
Well, the concern applies only to those data deduplication methods that calculate a hash code or fingerprint, such as EMC Centera or EMC Data Domain. If the hash code of new incoming data matches the hash code of existing data, then the new data is discarded and assumed to be identical. This is rare, and I have only read of a few occurrences of unique data being discarded in the past five years. To ensure full integrity, IBM ProtecTIER data deduplication solution and IBM N series disk systems chose instead to do full byte-for-byte comparisons.
- Compression: There are both lossy and lossless compression techniques. The lossless Lempel-Ziv algorithm is the basis for LTO-DC algorithm used in IBM's Linear Tape Open [LTO] tape drives, the Streaming Lossless Data Compression (SLDC) algorithm used in IBM's [Enterprise-class TS1130] tape drives, and the Adaptive Lossless Data Compression (ALDC) used by the IBM Information Archive for its disk pool collections.
Last month, IBM announced that it was [acquiring Storwize. It's Random Access Compression Engine (RACE) is also a lossless compression algorithm based on Lempel-Ziv. As servers write files, Storwize compresses those files and passes them on to the destination NAS device. When files are read back, Storwize retrieves and decompresses the data back to its original form.
To read independent views on IBM's acquisition, read Lauren Whitehouse (ESG) post [Remote Another Chair, Chris Mellor (The Register) article [Storwize Swallowed], or Dave Raffo (SearchStorage.com) article [IBM buys primary data compression].
As with tape, the savings from compression can vary, typically from 20 to 80 percent. In other words, 10TB of primary data could take up from 2TB to 8TB of physical space. To estimate what savings you might achieve for your mix of data types, try out the free [Storwize Predictive Modeling Tool].
So why am I making a distinction on terminology here?
Data reduction is already a well-known concept among specific industries, like High-Performance Computing (HPC) and Business Analytics. IBM has the largest marketshare in supercomputers that do data reduction for all kinds of use cases, for scientific research, weather prediction, financial projections, and decision support systems. IBM has also recently acquired a lot of companies related to Business Analytics, such as Cognos, SPSS, CoreMetrics and Unica Corp. These use data reduction on large amounts of business and marketing data to help drive new sources of revenues, provide insight for new products and services, create more focused advertising campaigns, and help understand the marketplace better.
There are certainly enough methods of reducing the quantity of storage capacity consumed, like thin provisioning, data deduplication and compression, to warrant an "umbrella term" that refers to all of them generically. I would prefer we do not "overload" the existing phrase "data reduction" but rather come up with a new phrase, such as "storage efficiency" or "capacity optimization" to refer to this category of features.
IBM is certainly quite involved in both data reduction as well as storage efficiency. If any of my readers can suggest a better phrase, please comment below.
technorati tags: IBM, data reduction, storage efficiency, histogram, linear regression, thin provisioning, data deduplication, lossy, lossless, EMC, Centera, hash collisions, Information Archive, LTO, LTO-DC, SLDC, ALDC, compression, deduplication, Storwize, supercomputers, HPC, analytics
IBM had its big launch yesterday of the [IBM Storwize V7000 midrange disk system], and already some have discussed IBM's choice of the name. Fellow blogger Stephen Foskett has an excellent post titled
[IBM’s Storwize V7000: 100% SVC; 0% Storwize]. On The Register, Chris Mellor writes [IBM's Midrange Storage Blast - Storwize. But Without Compression]. In his latest [Friday Rant], fellow blogger Chuck Hollis (EMC) feels "the new name is cool, if a bit misleading."
In the spirit of the [HP Product Line Decoder Ring] and [Microsoft Codename Tracker], here is your quick IBM product name decoder ring:
|In English||Protocols||Which company|
|What IBM decided|
to call it
|Intelligent block-level disk array that virtualizes both internal and external disk storage||8 Gbps FCP and 1GbE iSCSI||IBM||IBM Storwize V7000 disk system|
|Real-time compression appliance for files||10GbE/1GbE CIFS and NFS||Storwize, now an IBM company||IBM Real-time Compression STN-6800 appliance|
|1GbE CIFS and NFS||IBM Real-time Compression STN-6500 appliance|
If you think this is the first time a company like IBM has pulled shenanigans with product names like this, think again. Here are a few posts that might refresh your memory:
- In my September 2006 post, [A brand by any other name...] I explain that I started blogging specifically to promote the new "IBM System Storage" product line name, part of the "IBM Systems" brand resulting from merging the "eServer" and "TotalStorage' brands.
- In my January 2007 post, [When Names Change], I explain our naming convention for our disk products, including our DS family, SAN Volume Controller and N series.
- In my February 2008 post, [Getting Off the Island], I cover how the x/p/i/z designations came about for our various IBM server product lines.
But what about acquisitions? When [IBM acquired Lotus Development Corporation], it kept the "Lotus" brand. New products that fit the "collaboration" function were put under the Lotus brand. I think most people can accept this approach.
But have we ever seen an existing product renamed to an acquired name?
In my post January 2009 post
[Congratulations to Ken on your QCC Milestone], I mentioned that my colleague Ken Hannigan worked on an internal project initially called "Workstation Data Save Facility" (WDSF) which was changed to "Data Facility Distributed Storage Manager" (DFDSM), then renamed to "ADSTAR Distributed Storage Manager" (ADSM), and finally renamed to the name it has today: IBM Tivoli Storage Manager (TSM).
Readers reminded me that [IBM acquired Tivoli Systems, Inc.] in 1996, so TSM could not have been an internally developed product. Ha! Wrong! Let's take a quick history lesson on how this came about:
- In the late 1980s, IBM Almaden research had developed a project to backup personal computers and workstations, which they called "Workstation Data Save Facility" or WDSF.
This was turned over to our development team, which immediately discarded the code, and wrote from scratch its replacmeent, called Data Facility Distributed Storage Manager (DFDSM), named similar to the Data Facility products on the mainframe (DFP, DFHSM, DFDSS). As a member of the Data Facility family, DFDSM didn't really fit. The rest processed mainframe data sets, but DFDSM processed Windows and UNIX files. That a version of DFDSM server was available to run on the mainframe was the only connection.
Then, in the early 1990s, there were discussions of possibly splitting IBM into a bunch of smaller "Baby Blues", similar to how [AT&T was split into "Baby Bells"], and how Forbes and Goldman Sachs now want to split Microsoft into [Baby Bills]. IBM considered naming the storage spin-off as ADSTAR, which stood for "Advanced Storage and Retrieval."
Pre-emptively, IBM renamed DFDSM to "ADSTAR Distributed Storage Manager" or ADSM.
- Fortunately, in 1993, IBM brought a new sheriff to town, Lou Gerstner, who quickly squashed any plans to split up IBM. He quickly realized that IBM's core strength was building integrated stacks, combining systems, software and services to solve business problems.
- In 1996, IBM acquired Tivoli Systems, Inc. to expand its "Systems Management" portfolio, and renamed ADSM over to IBM Tivoli Storage Manager, since "storage management" is an essential part of "systems management". Later, IBM TotalStorage Productivity Center would be renamed to "IBM Tivoli Storage Productivity Center."
I participated in five months of painful meetings to figure out what to name our new internally-developed midrange disk system. Since it ran SAN Volume Controller software, I pushed for keeping the SVC designation somehow. We considered DS naming convention, but the new midrange product would not fit between our existing DS5000 and DS6000 numbering scheme. A marketing agency we hired came up with nonsensical names, in the spirit of product names like Celerra, Centera and CLARiiON, using name generators like [Wordoid]. Luckily, in the nick of time, IBM acquired Storwize for its compression technology, and decided that Storwize as a name was way better fit than any of the names we came up with already.
However, the new IBM Storwize V7000 midrange product had nothing in common with the appliances acquired from Storwize, the company, so to avoid confusion, the latter products were renamed to [IBM Real-time Compression]. Fellow blogger Steven Kenniston, the Storage Alchemist from Storwize fame now part of IBM from the acquisition, gives his perspective on this in his post [Storwize – What is in a Name, Really?]. While I am often critical of the names and terms IBM uses, I have to say this last set of naming decisions makes a lot of sense to me and I support it wholeheartedly.
To learn more about the IBM Storwize V7000 midrange disk system, watch the latest videos on the IBM Virtual Briefing Center (VBC). We have a [short summary version for CFO executives] as well as a
[longer version for IT technical professionals].
technorati tags: IBM, Storwize, Storwize V7000, Stephen Foskett, decoder+ring, real-time+compression, microsoft, codename, Lou Gerstner, ADSM, TSM, SVC
Continuing my coverage of the [IBM Storage Innovation Executive Summit], that occurred May 9 in New York City, this is my third in a series of blog posts on this event.
During lunch, people were able to take a look at our solutions. Here are Dan Thompson and Brett Cooper striking a pose.
- Hyper-Efficient Backup and Recovery
The afternoon was kicked off by Dr. Daniel Sabbah, IBM General Manager of Tivoli software. He started with some shocking statistics: 42 percent of small companies have experienced data loss, 32 percent have lost data forever. IBM has a solution that offers "Unified Recovery Management". This involves a combination of periodic backups, frequent snapshots, and remote mirroring.
IBM Tivoli Storage Manager (TSM) was introduced in 1993, and was the first backup software solution to support backup to disk storage pools. Today, TSM is now also part of Cloud Computing services, including IBM Information Protection Services. IBM announced today a new bundle called IBM Storwize Rapid Application Backup, which combines IBM Storwize V7000 midrange disk system, Tivoli FlashCopy Manager, implementation services, with a full three-year hardware and software warranty. This could be used, for example, to protect a Microsoft Exchange email system with 9000 mailboxes.
IBM also announced that its TS7600 ProtecTIER data deduplication solutions have been enhanced to support many-to-many bi-direction remote mirroring. Last year, University of Pittsburgh Medical Center (UPMC) reported that they were average 24x data deduplication factor in their environment using IBM ProtecTIER.
"You are out of your mind if you think you can live without tape!"
-- Dick Crosby, Director of System Administration, Estes
The new IBM TS1140 enterprise class tape drive process 2.3 TB per hour, and provides a density of 1.2 PB per square foot. The new 3599 tape media can hold 4TB of data uncompressed, which could hold up to 10TB at a 2.5x compression ratio.
The United States Golfers Association [USGA] uses IBM's backup cloud, which manages over 100PB of data from 750 locations across five continents.
- Customer Testimonial - Graybar
Randy Miller, Manager of Technical System Administration at Graybar, provided the next client testimonial. Graybar is an employee-owned company focused on supply-chain management, serving as a distributor for electical, lighting, security, power and cooling equipment.
Their problem was that they had 240 different locations, and expecting local staff to handle tape backups was not working out well. They centralized their backups to their main data center. In the event that a system fails in one of their many remote locations, they can rebuild a new machine at their main data center across high-speed LAN, and then ship overnight to the remote location. The result, the remote location has a system up and running by 10:30am, faster than they would have had from local staff trying to figure out how to recover from tape. In effect, Graybar had implemented a "private cloud" for backup in the 1990s, long before the concept was "cool" or "popular".
In 2001, they had an 18TB SAP ERP application data repository. To back this up, they took it down for 1 minute per day, six days a week, and 15 minutes down on Sundays. The result was less than 99.8 percent availability. To fix this, they switched to XIV, and use Snapshots that are non-disruptive and do not impact application performance.
Over 85 percent of the servers at Graybar are virtualized.
Their next challenge is Disaster Recovery. Currently, they have two datacenters, one in St. Louis and the other in Kansas City. However, in the aftermath of Japan's earthquakes, they realize there is a nuclear power plan between their two locations, so a single incident could impact both data centers. They are working with IBM, their trusted advisors, to investigate a three-site solution.
This week, May 15-22, I am in Auckland, New Zealand teaching IBM Storage Top Gun sales class. Next week, I will be in Sydney, Australia.
technorati tags: IBM, summit, NYC, Daniel Sabbah, TSM, Storwize, , TS7600, ProtecTIER, TS1140, tape, USGA, Graybar, Randy Miller, SAP, ERP, Disaster Recovery, New Zealand, Australia, Top Gun
Continuing my coverage of the IT Security and Storage Expo in Brussels, Belgium, we had some great storage solutions on display at the IBM and I.R.I.S-ICT booth.
Here my IBM colleague Tom Provost is showing the front of the "Smarter Office" solution. The second photo gives the view from behind. While I always explained the solution from the front of the box, many of the more technical attendees at this conference wanted to inspect the ports in the back.
This sound-isolated 11U solution combines the following:
- The [IBM Storwize V3700] with 300GB small-form-factor (SFF) drives provides shared storage for the servers.
- Two [IBM System x3550 M4 servers] that can run VMware, Hyper-V or Linux KVM server hypervisor software for your Windows and/or Linux applications. These are two socket servers that can have up to 16 x86 cores each.
- An [IBM System x3650 M4 server] pre-installed with backup software and an integrated [IBM RDX] removable disk cartridge system. (see 2010 Sep 27 for RDX reference)
- A Juniper EX2200 switch to network the servers and storage together.
- A Local Console Manager (LCM) with rackable keyboard, video, and mouse.
In this next example, the IBM team combined a BladeCenter S chassis that can hold six blade servers, with a Storwize V7000 Unified which offers FCP, iSCSI, FCoE, NFS, CIFS, HTTPS, SCP and FTP block and file protocols.
If those configurations are too small for your needs, consider the Flex System chassis or full PureFlex system frame. The rack-mountable 10U chassis can hold the Flex System V7000 and 10 compute notes. The PureFlex frame can hold up to four of these chasses.
IBM and I.R.I.S-ICT also had an IBM XIV Gen3 and a TS3500 Tape library on display.
technorati tags: IBM, I.R.I.S.-ICT, Belgium, Storage, Expo, Tom Provost, SFF, VMware, Hyper-V, Linux KVM, RDX, Veeam, Juniper Networks, LCM, , FCP, iSCSI, FCoE, NFS, CIFS, HTTPS, SCP, FTP, Flex System, PureFlex, PureSystems, XIV Gen3, TS3500, tape library
Modified by TonyPearson
IBM Cloud announcements at Pulse 2014
Well it's Tuesday again, and you know what that means? IBM announcements! Many of the announcements were made by IBM Executives at the [IBM Pulse 2014 conference].
IBM BlueMix is the newest cloud offering from IBM, providing Platform-as-a-Service (PaaS) offering based on the Cloud Foundry open source project that promises to deliver enterprise-level features and services that are easy to integrate into cloud applications.
In partnership with Pivotal and others, [IBM is a founding member of the Cloud Foundry foundation] to create an open platform that avoids vendor lock-in. Many PaaS stacks, such as [LAMP] or [Microsoft IIS], are typically limited to a single programming language, database and web application server, but not Cloud Foundry! Here is what is supported:
Development Frameworks: Cloud Foundry supports Java™ code, Spring, Ruby, Node.js, and custom frameworks.
Application Services: Cloud Foundry offers support for MySQL, MongoDB, PostgreSQL, Redis, RabbitMQ, and custom services.
Clouds: Developers and organizations can choose to run Cloud Foundry in Public, Private, Hybrid, VMware-based and OpenStack-based clouds.
To learn more, see this article on developerWorks [What is BlueMix?]
POWER and PureApplication Patterns of Expertise on SoftLayer
By the end of 2014, IBM is investing over $1.2B to have [40 Cloud centers across five continents] for SoftLayer.
This week, my fifth-line manager Tom Rosamilia, IBM Senior Vice President IBM Systems & Technology Group and Integrated Supply Chain made two announcements at Pulse. First, in additional to x86-based servers, SoftLayer will also offer POWER-based servers to run AIX, IBM i and [Linux on POWER] applications.
Second, SoftLayer will support PureApplication Patterns of Expertise. What is a pattern of expertise? It can be as simple as a virtual machine encapsulated in [Open Virtual Format], to more dynamic architectures, packaged with required platform services, that are deployed and managed by the system according to a set of policies.
Patterns simplify and automate tasks across the lifecycle of the application. Customers and partners alike are [seeing significant reductions in cost and time] across the application lifecycle with the deployment of a PureApplication System.
Also, this week at Pulse, Robert LaBlanc, IBM Senior Vice President of Software and Cloud Solutions, announced [IBM plans to Acquire Cloudant] which offers an open, cloud Database-as-a-Service (DBaaS) that helps organizations simplify mobile, web app and big data development efforts.
Why not just use a Relational Database Management System [RDBMS], like [IBM DB2 database softwareSQL], CouchDB is known as NOSQL. DB-Engines has a great side-by-side comparison [CouchDB vs. DB2].
IBM SmartCloud Virtual Storage Center offerings
When I introduced [SmartCloud Virtual Storage Center] back in October 2012, I mentioned that it was a great solution for large enterprise that have all of their disk behind SAN Volume Controller (SVC).
To reach smaller accounts, IBM has announced two new offerings:
IBM SmartCloud Virtual Storage Entry for customers that have less than 250TB of disk behind two or four SVC nodes. It is priced per terabyte, by the amount of capacity that is virtualized.
IBM SmartCloud Virtual Storage for Storwize Family for customers that have other Storwize family products (Storwize V7000 or V5000, for example). It is priced per the number of storage enclosures that are managed by the Storwize family hardware.
To learn more about Virtual Storage Center, see the [IBM Announcement page]
I am not at Pulse 2014 this year, but I managed to watch many of these announcements on the [IBM Pulse livestream].