Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.
Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.
Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.
Tony Pearson is a Master Inventor and Senior IT Specialist for the IBM System Storage product line at the
IBM Executive Briefing Center in Tucson Arizona, and featured contributor
to IBM's developerWorks. In 2011, Tony celebrated his 25th year anniversary with IBM Storage on the same day as the IBM's Centennial. He is
author of the Inside System Storage series of books. This blog is for the open exchange of ideas relating to storage and storage networking hardware, software and services. You can also follow him on Twitter @az990tony.
(Short URL for this blog: ibm.co/Pearson
Wrapping up my week's theme of storage optimization, I thought I would help clarify the confusion between data reduction and storage efficiency. I have seen many articles and blog posts that either use these two terms interchangeably, as if they were synonyms for each other, or as if one is merely a subset of the other.
Data Reduction is LOSSY
By "Lossy", I mean that reducing data is an irreversible process. Details are lost, but insight is gained. In his paper, [Data Reduction Techniques", Rajana Agarwal defines this simply:
"Data reduction techniques are applied where the goal is to aggregate or amalgamate the information contained in large data sets into manageable (smaller) information nuggets."
Data reduction has been around since the 18th century.
Take for example this histogram from [SearchSoftwareQuality.com]. We have reduced ninety individual student scores, and reduced them down to just five numbers, the counts in each range. This can provide for easier comprehension and comparison with other distributions.
The process is lossy. I cannot determine or re-create an individual student's score from these five histogram values.
This next example, complements of [Michael Hardy], represents another form of data reduction known as ["linear regression analysis"]. The idea is to take a large set of data points between two variables, the x axis along the horizontal and the y axis along the vertical, and find the best line that fits. Thus the data is reduced from many points to just two, slope(a) and intercept(b), resulting in an equation of y=ax+b.
The process is lossy. I cannot determine or re-create any original data point from this slope and intercept equation.
In this last example, from [Yahoo Finance], reduces millions of stock trades to a single point per day, typically closing price, to show the overall growth trend over the course of the past year.
The process is lossy. Even if I knew the low, high and closing price of a particular stock on a particular day, I would not be able to determine or re-create the actual price paid for individual trades that occurred.
Storage Efficiency is LOSSLESS
By contrast, there are many IT methods that can be used to store data in ways that are more efficient, without losing any of the fine detail. Here are some examples:
Thin Provisioning: Instead of storing 30GB of data on 100GB of disk capacity, you store it on 30GB of capacity. All of the data is still there, just none of the wasteful empty space.
Space-efficient Copy: Instead of copying every block of data from source to destination, you copy over only those blocks that have changed since the copy began. The blocks not copied are still available on the source volume, so there is no need to duplicate this data.
Archiving and Space Management: Data can be moved out of production databases and stored elsewhere on disk or tape. Enough XML metadata is carried along so that there is no loss in the fine detail of what each row and column represent.
Data Deduplication: The idea is simple. Find large chunks of data that contain the same exact information as an existing chunk already stored, and merely set a pointer to avoid storing the duplicate copy. This can be done in-line as data is written, or as a post-process task when things are otherwise slow and idle.
When data deduplication first came out, some lawyers were concerned that this was a "lossy" approach, that somehow documents were coming back without some of their original contents. How else can you explain storing 25PB of data on only 1PB of disk?
(In some countries, companies must retain data in their original file formats, as there is concern that converting business documents to PDF or HTML would lose some critical "metadata" information such as modificatoin dates, authorship information, underlying formulae, and so on.)
Well, the concern applies only to those data deduplication methods that calculate a hash code or fingerprint, such as EMC Centera or EMC Data Domain. If the hash code of new incoming data matches the hash code of existing data, then the new data is discarded and assumed to be identical. This is rare, and I have only read of a few occurrences of unique data being discarded in the past five years. To ensure full integrity, IBM ProtecTIER data deduplication solution and IBM N series disk systems chose instead to do full byte-for-byte comparisons.
Compression: There are both lossy and lossless compression techniques. The lossless Lempel-Ziv algorithm is the basis for LTO-DC algorithm used in IBM's Linear Tape Open [LTO] tape drives, the Streaming Lossless Data Compression (SLDC) algorithm used in IBM's [Enterprise-class TS1130] tape drives, and the Adaptive Lossless Data Compression (ALDC) used by the IBM Information Archive for its disk pool collections.
Last month, IBM announced that it was [acquiring Storwize. It's Random Access Compression Engine (RACE) is also a lossless compression algorithm based on Lempel-Ziv. As servers write files, Storwize compresses those files and passes them on to the destination NAS device. When files are read back, Storwize retrieves and decompresses the data back to its original form.
As with tape, the savings from compression can vary, typically from 20 to 80 percent. In other words, 10TB of primary data could take up from 2TB to 8TB of physical space. To estimate what savings you might achieve for your mix of data types, try out the free [Storwize Predictive Modeling Tool].
So why am I making a distinction on terminology here?
Data reduction is already a well-known concept among specific industries, like High-Performance Computing (HPC) and Business Analytics. IBM has the largest marketshare in supercomputers that do data reduction for all kinds of use cases, for scientific research, weather prediction, financial projections, and decision support systems. IBM has also recently acquired a lot of companies related to Business Analytics, such as Cognos, SPSS, CoreMetrics and Unica Corp. These use data reduction on large amounts of business and marketing data to help drive new sources of revenues, provide insight for new products and services, create more focused advertising campaigns, and help understand the marketplace better.
There are certainly enough methods of reducing the quantity of storage capacity consumed, like thin provisioning, data deduplication and compression, to warrant an "umbrella term" that refers to all of them generically. I would prefer we do not "overload" the existing phrase "data reduction" but rather come up with a new phrase, such as "storage efficiency" or "capacity optimization" to refer to this category of features.
IBM is certainly quite involved in both data reduction as well as storage efficiency. If any of my readers can suggest a better phrase, please comment below.
The new [IBM System Storage Tape Controller 3592 Model C07] is an upgrade to the previous C06 controller. Like the C06, the new 3592-C07 can have up to four FICON (4Gbps) ports, four FC ports, and connect up to 16 drives. The difference is that the C07 supports 8Gbps speed FC ports, and can support the [new TS1140 tape drives that were announced on May 9]. A cool feature of the C07 is that it has a built-in library manager function for the mainframe. On the previous models, you had to have a separate library manager server.
Crossroads ReadVerify Appliance (3222-RV1)
IBM has entered an agreement to resell [Crossroads ReadVerify Appliance], or "RV1" for short. The RV1 is a 1U-high server with software that gathers information on the utilization, performance and health for a physical tape environment, such as an IBM TS3500 Tape Library. The RV1 also offers a feature called "ArchiveVerify" which validates long-term retention archive tapes, providing an audit trail on the readability of tape media. This can be useful for tape libraries attached behind IBM Information Archive compliance storage solution, or the IBM Scale-Out Network Attached Storage (SONAS).
As an added bonus, Crossroads has great videos! Here's one, titled [Tape Sticks]
Linear Tape File System (LTFS) Library Edition Version 2.1
While the hardware is all refreshed, the overall "scale-out" architecture is unchanged. Kudos to the XIV development team for designing a system that is based entirely on commodity hardware, allowing new hardware generations to be introduced with minimal changes to the vast number of field-proven software features like thin provisioning, space-efficient read-only and writeable snapshots, synchronous and asynchronous mirroring, and Quality of Service (QoS) performance classes.
The new XIV Gen3 features an Infiniband interconnect, faster 8Gbps FC ports, more iSCSI ports, faster motherboard and processors, SAS-NL 2TB drives, 24GB cache memory per XIV module, all in a single frame IBM rack that supports the IBM Rear Door Heat Exchanger. The results are a 2x to 4x boost in performance for various workloads. Here are some example performance comparisons:
Disclaimer: Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here. Your mileage may vary.
In a Statement of Direction, IBM also has designed the Gen3 modules to be "SSD-ready" which means that you can insert up to 500GB of Solid-State drive capacity per XIV module, up to 7.5TB in a fully-configured 15 module frame. This SSD would act as an extension of DRAM cache, similar to how Performance Accelerator Modules (PAM) on IBM N series.
IBM will continue to sell XIV Gen2 systems for the next 12-18 months, as some clients like the smaller 1TB disk drives. The new Gen3 only comes with 2TB drives. There are some clients that love the XIV so much, that they also use it for less stringent Tier 2 workloads. If you don't need the blazing speed of the new Gen3, perhaps the lower cost XIV Gen2 might be a great fit!
As if I haven't said this enough times already, the IBM XIV is a Tier-1, high-end, enterprise-class disk storage system, optimized for use with mission critical workloads on Linux, UNIX and Windows operating systems, and is the ideal cost-effective replacement for EMC Symmetrix VMAX, HDS USP-V and VSP, and HP P9000 series disk systems, . Like the XIV Gen2, the XIV Gen3 can be used with IBM System i using VIOS, and with IBM System z mainframes running Linux, z/VM or z/VSE. If you run z/OS or z/TPF with Count-Key-Data (CKD) volumes and FICON attachment, go with the IBM System Storage DS8000 instead, IBM's other high-end disk system.
By combining multiple components into a single "integrated system", IBM can offer a blended disk-and-tape storage solutions. This provides the best of both worlds, high speed access using disk, while providing lower costs and more energy efficiency with tape. According to a study by the Clipper Group, tape can be 23 times less expensive than disk over a 5 year total cost of ownership (TCO).
I've also covered Hierarchical Storage Management, such as my post [Seven Tiers of Storage at ABN Amro], and my role as lead architect for DFSMS on z/OS in general, and DFSMShsm in particular.
However, some explanation might be warranted in the use of these two terms in regards to SONAS. In this case, ILM refers to policy-based file placement, movement and expiration on internal disk pools. This is actually a GPFS feature that has existed for some time, and was tested to work in this new configuration. Files can be individually placed on either SAS (15K RPM) or SATA (7200 RPM) drives. Policies can be written to move them from SAS to SATA based on size, age and days non-referenced.
HSM is also a form of ILM, in that it moves data from SONAS disk to external storage pools managed by IBM Tivoli Storage Manager. A small stub is left behind in the GPFS file system indicating the file has been "migrated". Any reference to read or update this file will cause the file to be "recalled" back from TSM to SONAS for processing. The external storage pools can be disk, tape or any other media supported by TSM. Some estimate that as much as 60 to 80 percent of files on NAS have low reference and should be stored on tape instead of disk, and now SONAS with HSM makes that possible.
This distinction allows the ILM movement to be done internally, within GPFS, and the HSM movement to be done externally, via TSM. Both ILM and HSM movement take advantage of the GPFS high-speed policy engine, which can process 10 million files per node, run in parallel across all interface nodes. Note that TSM is not required for ILM movement. In effect, SONAS brings the policy-based management features of DFSMS for z/OS mainframe to all the rest of the operating systems that access SONAS.
HTTP and NIS support
In addition to NFS v2, NFS v3, and CIFS, the SONAS v1.1.1 adds the HTTP protocol. Over time, IBM plans to add more protocols in subsequent releases. Let me know which protocols you are interested in, so I can pass that along to the architects designing future releases!
SONAS v1.1.1 also adds support for Network Information Service (NIS), a client/server based model for user administration. In SONAS, NIS is used for netgroup and ID mapping only. Authentication is done via Active Directory, LDAP or Samba PDC.
SONAS already had synchronous replication, which was limited in distance. Now, SONAS v1.1.1 provides asynchronous replication, using rsync, at the file level. This is done over Wide Area Network (WAN) across to any other SONAS at any distance.
Interface modules can now be configured with either 64GB or 128GB of cache. Storage now supports both 450GB and 600GB SAS (15K RPM) and both 1TB and 2TB SATA (7200 RPM) drives. However, at this time, an entire 60-drive drawer must be either all one type of SAS or all one type of SATA. I have been pushing the architects to allow each 10-pack RAID rank to be independently selectable. For now, a storage pod can have 240 drives, 60 drives of each type of disk, to provide four different tiers of storage. You can have up to 30 storage pods per SONAS, for a total of 7200 drives.
An alternative to internal drawers of disk is a new "Gateway" iRPQ that allows the two storage nodes of a SONAS storage pod to connect via Fibre Channel to one or two XIV disk systems. You cannot mix and match, a storage pod is either all internal disk, or all external XIV. A SONAS gateway combined with external XIV is referred to as a "Smart Business Storage Cloud" (SBSC), which can be configured off premises and managed by third-party personnel so your IT staff can focus on other things.
See the Announcement Letters for the SONAS [hardware] and [software] for more details.
For those who are wondering how this positions against IBM's other NAS solution, the IBM System Storage N series, the rule of thumb is simple. If your capacity needs can be satisfied with a single N series box per location, use that. If not, consider SONAS instead. For those with non-IBM NAS filers that realize now that SONAS is a better approach, IBM offers migration services.
Both the Information Archive and the SONAS can be accessed from z/OS or Linux on System z mainframe, from "IBM i", AIX and Linux on POWER systems, all x86-based operating systems that run on System x servers, as well as any non-IBM server that has a supported NAS client.
Wrapping up my week's coverage of the IBM Pulse 2011 conference, I have had several people ask me to explain IBM's latest initiative, Smarter Computing, which IBM launched this week at this conference. Having led the IT industry through the Centralized Computing era and the Distributed Computing era, IBM is now well-positioned to help companies, governments and non-profit organizations to enter the new Smarter Computing era, focused on insight and discovery.
Thousands of IT professionals
Effiicent, but only the largest companies and governments had them
Millions of office workers
Personal computers (PC)
Innovative, extending the reach to small and medium-sized businesses, but resulted in server sprawl and increased TCO
Billions of people
Smart phones and other handheld devices
Efficient and Innovative, combining the best of centralized and distributed computing
1952 to 1980
1981 to 2010
2011 and beyond
To help clients with this transition, IBM's Smarter Computing initiative has three main components. This is a corporate-wide strategy, with systems, software and services all working together to realize results.
The first component is Big Data. This combines three different sources of data:
Traditional structured data in OLTP databases and OLAP data warehouses, using data management solutions like DB2 and IBM Netezza.
Unstructured data, including text documents, images, audio, and video, processed with massive parallelism using IBM BigInsights and Apache Hadoop.
Real-Time Analytics Processing (RTAP) of incoming data, including video surveillance, social media, RFID chips, smart meters, and traffic control systems, processed with IBM InfoSphere Streams
Of course, Big Data will bring new opportunities on the storage front, which I will save for a future post!
Rather than general purpose IT equipment, we have now the scale and scope to specialize with systems optimized for particular workloads, the second component of the Smarter Computing initiative. Of course, IBM has been delivering integrated stacks of systems, software and services for decades now, but it is important to remind people of this, as IBM now has a spate of competitors all trying to follow IBM's lead in this arena.
As with Big Data, the focus on Optimized Systems has impacted IBM's strategy on storage as well. I'll save that discussion for a future post as well!
I am glad that nearly all of the storage vendors have standardized to a common definition for Cloud, the third component of Smarter Computing, which shows that this concept has matured:
Cloud computing is a pay-per-use model for enabling network access to a pool of computing resources that can be provisioned and released rapidly with minimal management effort or service provider interaction. -- U.S. National Institute of Standards and Technology [nist.gov]
Of course, Cloud is just an evolution of IBM's Service Bureau business of the 1960s and 1970s, renting out time-sharing on mainframe systems, Grid Computing of the 1980s, and Application Service Providers that popped up in the 1990s. While the [butchers, bakers and candlestick makers] that IBM competes against might focus their efforts on just private cloud or just public cloud, IBM recognizes the reality is that different clients will need different solutions. Rather than rip-and-replace, IBM will help clients transition to cloud via inclusive solutions that adopt a hybrid approach:
Traditional enterprise with private cloud deployments, using solutions like IBM CloudBurst, SONAS and Information Archive
Traditional enterprise with public cloud services to handle seasonable peaks, providing offsite resiliency, and solutions for a mobile workforce
Hybrid clouds that blend private and public cloud services, to handle seasonal peak workloads, remote and branch offices
IBM's emphasis on IT Infrastructure Library (ITIL), Tivoli and Maximo products will play well in this space to provide integrated service management across traditional and cloud deployments. This is why IBM decided to launch Smarter Computing initiative at Pulse 2011 conference, the industry's premiere conference on intergrated service management.
The IBM Watson that competed on Jeopardy! is an excellent example of all three components of Smarter Computing at work.
IBM Watson was able to respond to Jeopardy! clues within three seconds, processing a combination of database searches with DB2 and text-mining analytics of unstructured data with IBM BigInsights.
IBM Watson combined servers, software and storage into an integrated supercomputer that was optimized for one particular workload: playing Jeopardy!
IBM Watson used many technologies prevalent in private and public cloud computing systems, storing its data on a modified version of SONAS for storage, using xCat administration tools, networking across 10GbE Ethernet, and massive parallel processing through lots of PowerVM guest images.
In less than a month, I will be presenting at the annual IBM Storage Technical University, July 18-22, at the Hilton in Orlando, Florida. This is one of my favorite conferences! You can sign up for this at their [Online Registration Page].
I will be covering a variety of topics:
IBM Storage Strategy in the Era of Smarter Computing - After IBM has led the IT industry through the "Centralized Computing" era, and then later the "Distributed Computing" era, we are now entering the third era, that of Smarter Computing. Come learn IBM's strategy for Storage to address today's big challenges, including Big Data, Integrated Workload-optimized systems, and Cloud service delivery models.
IBM Information Archive for Email, Files and eDiscovery - This session will cover the latest announcement for our non-erasable, non-rewriteable compliance storage, the Information Archive (IA), how this can be used to protect your emails and files, and provide indexed search to assist with eDiscovery.
IBM Tivoli Storage Productivity Center Overview and Update - I was one of the original lead architects for Productivity Center. Come learn what this software is all about, and how the latest features and functions can help you manager your IT environment.
IBM SONAS and the Smart Business Storage Cloud - Confused about Cloud Computing and Cloud Storage? I will explain everything you need to know, including how the integrated SONAS appliance operates, IBM's customized solutions for private cloud deployments, and IBM's public cloud offerings.
BOF on Social Media - BOF stands for "Birds of a Feather", and his normally an after-hours discussion on a single theme. This BOF will be a four-expert Q&A panel, including myself, John Sing, Rich Swain and Ian Wright. We will discuss how we got started in Social Media, and how it has boosted our careers and our ability to get work done.