Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.
Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.
Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.
Tony Pearson is a Master Inventor and Senior IT Specialist for the IBM System Storage product line at the
IBM Executive Briefing Center in Tucson Arizona, and featured contributor
to IBM's developerWorks. In 2011, Tony celebrated his 25th year anniversary with IBM Storage on the same day as the IBM's Centennial. He is
author of the Inside System Storage series of books. This blog is for the open exchange of ideas relating to storage and storage networking hardware, software and services. You can also follow him on Twitter @az990tony.
(Short URL for this blog: ibm.co/Pearson
Since Clod Barrera introduced IBM's Smarter Computing initiative during yesterday's keynote session, I took it to the next lower level, with a presentation on how IBM's Storage Strategy aligns with the Smarter Computing approach.
Deduplication -- It's Not Magic, It's Math!
Local IBMer Paul Rizio presented this high-level session on the concepts of data deduplication, and how it is implemented in IBM's N series, TSM and ProtecTIER virtual tape libraries. I first met Paul earlier this year when we were both instructors at Top Gun classes we held in Auckland, New Zealand and Sydney, Australia.
IBM Information Archive for files, email and eDiscovery
This was a reprise of my presentation that I gave last July in Orlando, Florida (see my blog post [IBM Storage University - Day 1]). I explained the differences between backup and archive, the differences between Tivoli Storage Manager and System Storage Archive Manager, and the Information Archive (IA) The Information Archive for files, email and eDiscovery bundle combines IA hardware with content collectors for files and email, eDiscovery analyzer and eDiscovery manager software.
What are Industry Consultants saying about IBM Storage?
Vic Peltz, from our IBM Almaden Research Center, presented this lively presentation on how IT industry analysts gather their information and structure their findings into various models. For many in the audience, this would be their first exposure to concepts like a "Magic Quadrant", "MarketScope" and the various stages of the "Hype Cycle".
IBM SONAS and the Smart Business Storage Cloud
The title of this session just rolls off my tongue, similar to "James and the Giant Peach" or "Harold and the Purple Crayon". I had presented this back in July (see my blog post [IBM Storage University - Cloud Storage]). This time, I had updated the materials to reflect the new SONAS R1.3 release, and the new IBM SmartCloud offerings announced last month.
Of course the big news is that U.S. President Barack Obama is here in Australia, with a stop in Canberra (not far from Melbourne), followed by a stop in Darwin on the north side of this country. This is his first official visit to Australia as president.
An exciting new addition to the IBM storage line, the Storwize V7000 is a very versatile and solid choice as a midrange storage device. This session will cover a technical overview of the controller as well as its positioning within the overall IBM storage line.
xST04 - XIV Implementation, Migration and Optimization
Attend this session to learn how to integrate the IBM XIV Storage System in your IT environment. After this session, you should understand where the IBM XIV Storage system fits, and understand how to take full advantage of the performance capabilities of XIV Storage by using the massive parallelism of its grid architecture. You will learn how to migrate data onto the XIV and hear about real world client experiences.
xST05 - IBM's Storage Strategy in the Smarter Computing Era
Want to understand IBM's storage strategy better? This session will cover the three key themes of IBM's Smarter Computing initiative: Big Data, Optimized Systems, and Cloud. IBM System Storage strategy has been aligned to meet the storage efficiency, data protection and retention required to meet these challenges.
IBM offers encryption in a variety of ways. Data can be encrypted on the server, in the SAN switch, or on the disk or tape drive. This session will explain how encryption works, and explain the pros and cons with each encryption option.
sAC01 - IBM Information Archive for email, Files and eDiscovery
IBM has focused on data protection and retention, and the IBM Information Archive is the ideal product to achieve it. Come to this session to discuss archive solutions, compliance regulations, and support for full-text indexing and eDiscovery to support litigation.
sGE04 - IBM's Storage Strategy in the Smarter Computing Era
Want to understand IBM's storage strategy better? This session will cover the three key themes of IBM's Smarter Computing initiative: Big Data, Optimized Systems, and Cloud. IBM System Storage strategy has been aligned to meet the storage efficiency, data protection and retention required to meet these challenges.
sSM03 - IBM Tivoli Storage Productivity Center – Overview and Update
IBM's latest release of IBM Tivoli Storage Productivity Center is v4.2.2, a storage resource management tool that manages both IBM and non-IBM storage devices, including disk systems, tape libraries, and SAN switches. This session will give an overview of the various components of Tivoli Storage Productivity Center and provide an update on what's new in this product.
sSN06 - SONAS and the Smart Business Storage Cloud (SBSC)
Confused over IBM's Cloud strategy? Trying to figure out how IBM Storage plays in private, hybrid or public cloud offerings? This session will cover both the SONAS integrated appliance and the Smart Business Storage Cloud customized solution, and will review available storage services on the IBM Cloud.
sTA01 - Tape Storage Reinvented: What's New and Exciting in the Tape World?
This very informative session will keep you up to date with the latest tape developments. These include the TS3500 tape library connector Model SC1 (Shuttle). The shuttle enables extreme scalability of over 300,000 tape cartridges in a single library image by interconnecting multiple tape libraries with a unique, high speed transport system. The world's fastest tape drive, the TS1140 3592-E07, will also be presented. The performance and functionality of the new TS1140 as well as the new 4TB tape media will be discussed. Also, the IBM System Storage Linear Tape File System (LTFS), including the Library Edition, will be presented. LTFS allows a disk-like, drag-and-drop interface for tape. This is a not-to-be-missed session for all you tape lovers out there!
In December, I will be going to Gartner's Data Center Conference in Las Vegas, but the agenda has not been finalized, so I will save that for another post.
IBM Information Archive for email, files and eDiscovery
Not too many people have heard of IBM's Smart Archive strategy and the storage products IBM offers to meet compliance regulations. This session covered the following:
The differences between backup and archive, including a few of my own personal horror stories helping companies who had foolishly thought that keeping backup copies for years would adequately serve as their archive strategy
The differences between optical media, Write-Once Read-Many (WORM) media, and Non-Erasable, Non-Rewriteable (NENR) storage options.
Why putting a [space heater] on your data center floor is a bad idea, driving up power and cooling costs for little business value to the enterprise once the unit is full of rarely accessed read-only data.
An overview of the [IBM Information Archive], an integrated stack of servers, storage and software that replaces previous offerings such as the IBM System Storage DR550 and the IBM Grid Medical Archive Solution (GMAS).
The marketing bundle known as the [Information Archive for Email, Files and eDiscovery] that combines the Information Archive storage appliance with Content Collectors for email and file systems, as well as eDiscovery tools, and implementation services for a solution that can support a small or medium size business, up to 1400 employees.
IBM Tivoli Storage Productivity Center v4.2 Overview and Update
Many of the concerns raised when I [presented v4.1 at this conference last year] were addressed this year in v4.2, including full performance statistics for IBM XIV storage system, storage resource agent support for HP-UX and Solaris, and a variety of other issues.
I presented this overview in stages:
"Productivity Center Basic Edition" that comes pre-installed on the IBM System Storage Productivity Center hardware console, that provides discover of devices, basic configuration, and a clever topology viewer of what is connected to what.
"Productivity Center for Disk" and "Productivity Center for Disk Midrange Edition (MRE)" that provides real-time and historical performance monitoring, asset and capacity reporting.
"Productivity Center for Replication" which supports monitoring, failover and failback for FlashCopy, Metro Mirror and Global Mirror on the SVC, Storwize V7000, DS8000, DS6000 and ESS 800.
"Productivity Center for Data" which supports reporting on files, file systems and databases on DAS, SAN and NAS attached storage from a Operating System viewpoint.
"Productivity Center Standard Edition" which includes all of the above except "Replication", and adds performance monitoring of SAN Fabric gear, and some very clever analytics to improver performance and problem determination.
One of the questions that came up was "How big does my company have to be to consider using Productivity Center?" which I answered as follows:
"If you are a small company, and the "IT Person" has responsibilities outside the IT, and managing the few pieces of kit is just part of his job, then consider just using the web-based GUI through a Firefox or similar browser. If you are a medium sized company with dedicated IT personnel, but mostly run by system admins or database admins that manage storage and networks on the side, you might want to consider the "Storage Control" plug-in for IBM Systems Director. But if you are larger shop, and there are employees with the title "Storage Administrator" and/or "SAN Administrator", then Tivoli Storage Productivity Center is for you."
Tivoli Storage Productivity Center has matured into a fine piece of software that truly can help medium and large sized data centers manage their storage and storage networking infrastructure.
I like speaking the first day of these events. Often people come in just to hear the keynote speakers, and stay the afternoon to hear a few break-out sessions before they leave Tuesday or Wednesday for other meetings.
The new [IBM System Storage Tape Controller 3592 Model C07] is an upgrade to the previous C06 controller. Like the C06, the new 3592-C07 can have up to four FICON (4Gbps) ports, four FC ports, and connect up to 16 drives. The difference is that the C07 supports 8Gbps speed FC ports, and can support the [new TS1140 tape drives that were announced on May 9]. A cool feature of the C07 is that it has a built-in library manager function for the mainframe. On the previous models, you had to have a separate library manager server.
Crossroads ReadVerify Appliance (3222-RV1)
IBM has entered an agreement to resell [Crossroads ReadVerify Appliance], or "RV1" for short. The RV1 is a 1U-high server with software that gathers information on the utilization, performance and health for a physical tape environment, such as an IBM TS3500 Tape Library. The RV1 also offers a feature called "ArchiveVerify" which validates long-term retention archive tapes, providing an audit trail on the readability of tape media. This can be useful for tape libraries attached behind IBM Information Archive compliance storage solution, or the IBM Scale-Out Network Attached Storage (SONAS).
As an added bonus, Crossroads has great videos! Here's one, titled [Tape Sticks]
Linear Tape File System (LTFS) Library Edition Version 2.1
While the hardware is all refreshed, the overall "scale-out" architecture is unchanged. Kudos to the XIV development team for designing a system that is based entirely on commodity hardware, allowing new hardware generations to be introduced with minimal changes to the vast number of field-proven software features like thin provisioning, space-efficient read-only and writeable snapshots, synchronous and asynchronous mirroring, and Quality of Service (QoS) performance classes.
The new XIV Gen3 features an Infiniband interconnect, faster 8Gbps FC ports, more iSCSI ports, faster motherboard and processors, SAS-NL 2TB drives, 24GB cache memory per XIV module, all in a single frame IBM rack that supports the IBM Rear Door Heat Exchanger. The results are a 2x to 4x boost in performance for various workloads. Here are some example performance comparisons:
Disclaimer: Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here. Your mileage may vary.
In a Statement of Direction, IBM also has designed the Gen3 modules to be "SSD-ready" which means that you can insert up to 500GB of Solid-State drive capacity per XIV module, up to 7.5TB in a fully-configured 15 module frame. This SSD would act as an extension of DRAM cache, similar to how Performance Accelerator Modules (PAM) on IBM N series.
IBM will continue to sell XIV Gen2 systems for the next 12-18 months, as some clients like the smaller 1TB disk drives. The new Gen3 only comes with 2TB drives. There are some clients that love the XIV so much, that they also use it for less stringent Tier 2 workloads. If you don't need the blazing speed of the new Gen3, perhaps the lower cost XIV Gen2 might be a great fit!
As if I haven't said this enough times already, the IBM XIV is a Tier-1, high-end, enterprise-class disk storage system, optimized for use with mission critical workloads on Linux, UNIX and Windows operating systems, and is the ideal cost-effective replacement for EMC Symmetrix VMAX, HDS USP-V and VSP, and HP P9000 series disk systems, . Like the XIV Gen2, the XIV Gen3 can be used with IBM System i using VIOS, and with IBM System z mainframes running Linux, z/VM or z/VSE. If you run z/OS or z/TPF with Count-Key-Data (CKD) volumes and FICON attachment, go with the IBM System Storage DS8000 instead, IBM's other high-end disk system.
In less than a month, I will be presenting at the annual IBM Storage Technical University, July 18-22, at the Hilton in Orlando, Florida. This is one of my favorite conferences! You can sign up for this at their [Online Registration Page].
I will be covering a variety of topics:
IBM Storage Strategy in the Era of Smarter Computing - After IBM has led the IT industry through the "Centralized Computing" era, and then later the "Distributed Computing" era, we are now entering the third era, that of Smarter Computing. Come learn IBM's strategy for Storage to address today's big challenges, including Big Data, Integrated Workload-optimized systems, and Cloud service delivery models.
IBM Information Archive for Email, Files and eDiscovery - This session will cover the latest announcement for our non-erasable, non-rewriteable compliance storage, the Information Archive (IA), how this can be used to protect your emails and files, and provide indexed search to assist with eDiscovery.
IBM Tivoli Storage Productivity Center Overview and Update - I was one of the original lead architects for Productivity Center. Come learn what this software is all about, and how the latest features and functions can help you manager your IT environment.
IBM SONAS and the Smart Business Storage Cloud - Confused about Cloud Computing and Cloud Storage? I will explain everything you need to know, including how the integrated SONAS appliance operates, IBM's customized solutions for private cloud deployments, and IBM's public cloud offerings.
BOF on Social Media - BOF stands for "Birds of a Feather", and his normally an after-hours discussion on a single theme. This BOF will be a four-expert Q&A panel, including myself, John Sing, Rich Swain and Ian Wright. We will discuss how we got started in Social Media, and how it has boosted our careers and our ability to get work done.
Doug Balog, IBM VP and Business Level Executive for Storage, presented Smart Archiving. Citing research by Jon Toigo, Doug indicated that 40 percent of data on disk should be archived. Sadly, a vast majority of companies continue to use their backups as archives. There is a better way to do archives, to address the needs of four use cases:
The IBM Information Archive for email, files and eDiscovery offers full text indexing. A well-deployed archive strategy can save up to 60 percent in backup costs, and reduce backup times by 80 percent. IBM offers advanced analytics and visualization for archive data.
An analysis of a global insurance company found that they kept, on average, 120 copies of every email sent. This was the combination of an average of 12 copies of the email, multipled by 10 backups of the email repository.
Banjercito, a bank in Mexico, has a 10-year retention requirement from government regulations.
The new LTFS Library Edition allows Library-based access to files stored on tape cartridges. The new TS3500 Library Connector means that a single system of connected tape libraries can hold up to 2.7 Exabytes (EB) of data.
Archive Industry Perspectives
Steve Duplessie from Enterprise Strategy Group [ESG] gave his views on the challenges of volume, access and cost. His definition for archive: the long term retention of information on a separate environment for compliance, eDiscovery and business reference purposes. Steve advocates a purpose-built solutiion for archive. There are three major challenges for implementing an archive solution:
Getting Participation -- Steve feels that key stakeholders have inappropriate expectations of what archive is, or can be.
Define Tasks -- Steve argues that archive is very much a process-oriented approach, and tasks must fit business process and procedures
Prepare for Future Content Types -- the frequent change of standard and proprietary data types poses a real challenge for long term retention of data
For example, the Financial Industry Regulatory Authority [FINRA] oversee 4,000 brokerage firms, and 600,000 broker/dealers. They have mandated the storing of digital data related to stock trades, and this can include text messages, voice messages, and emails. They continue to expand this definition, so soon this could include tweets on Twitter, for example.
Steve feels there are four key requirements for archive:
Support for email, such as an email application plug-in
Off-line access to archived data
Support for mobile devices, such as smartphones
Basic search capabilities
Companies are starting to take archive seriously. About 35 percent of firms surveyed have adopted archive, and another 36 percent plan to in the next 12-24 months. Enterprise archive has grown over 200 percent from 2007 to 2009. Steve agrees that not everything needs to be stored on disk. Retention periods greater than six years dictates the need for tape.
Current systems may not meet today's requirements. Data loss and downtime costs have skyrocketed. Data Protection and Retention projects can represent a gold mine of savings, new capabilities can greatly lower costs, allowing companies to shift resources over to revenue generation.
Big Data, New Physics and Geospatial Super-Food
I would vote this the best session of the day! For all those confused on what the heck "Big Data" means, Jeff has the best explanation. Jeff Jonas is an IBM Distinguished Engineer and the Chief Scientist of Entity Analytics. He had just finished his 17th marathon on Saturday, and his fingers were bandaged.
Jeff had founded the Systems Research and Design (SR&D) company, known for creating NORA (non-obvious relationship awareness) used by Las Vegas casinos to identify fraud. SR&D was acquired by IBM back in 2005. Jeff is focused on sensemaking of streams. He feels many companies are suffering from "Enterprise Amnesia".
"The data must find the data .. and the relevance must find the user."
-- Jeff Jonas
Jeff's metaphor to Big Data is a jigsaw puzzle without the picture on the outside of the box. To demonstrate his point, he presented a pile of jigsaw puzzle pieces and asked four teenagers to put the puzzle together without the advantage of the picture on the box. What he had not told them was that he mixed four different puzzles together, removing out 10 to 20 percent of the pieces from each puzzle. He also added some duplicate pieces from a second identical puzzle, and just to make things fun, included a dozen pieces from a sixth puzzle just to mess with their heads. Within a few hours, the kids had managed to figure out that there were four puzzles, that there were duplicate pieces, and that there were some pieces that did not fit any of the four puzzles.
"You can't squeeze knowledge from a pixel."
-- Jeff Jonas
This approach favors false negatives. New observations reverse out old conceptions. As the picture emerges, this provides added focus on new information. More data can provide better predictions. "Bad" data, including misspelled words and mis-coded categories, was often discarded or corrected on the basis of "Garbage-In, Garbage Out", but can now be useful in a Big Data perspective.
Take for example the 600 billion recordings of the "location data" captured on cell phones every day. With regular triangulation of cell phone towers, the information can pinpoint you within 60 meters, add GPS and this improved to within 20 meters, and add Wi-Fi is further improved to 10 meters. While this data is "de-identified" so as not to identify individual users, the process of re-identification is relatively trivial. Jeff's system is able to predict a person will be next Thursday at 5:35pm with 87 percent accuracy.
Thus, Big Data represents an asset, accumulation of context. Real-time analytics can be a competitive advantage. These streams of data will need persistent storage and massive I/O capabilities. In one example, Jeff processed 4,200 separate sources of information and was able to identify "dead votes". These are votes cast by people that died in years prior, indicating voter fraud.
Jeff's latest project, codenamed G2, will tackle not just people, but everything from proteins to asteroids.
Normally, the worst time slot is the hour after lunch, but these presentations kept people's attention.
Wrapping up my week's coverage of the IBM Pulse 2011 conference, I have had several people ask me to explain IBM's latest initiative, Smarter Computing, which IBM launched this week at this conference. Having led the IT industry through the Centralized Computing era and the Distributed Computing era, IBM is now well-positioned to help companies, governments and non-profit organizations to enter the new Smarter Computing era, focused on insight and discovery.
Thousands of IT professionals
Effiicent, but only the largest companies and governments had them
Millions of office workers
Personal computers (PC)
Innovative, extending the reach to small and medium-sized businesses, but resulted in server sprawl and increased TCO
Billions of people
Smart phones and other handheld devices
Efficient and Innovative, combining the best of centralized and distributed computing
1952 to 1980
1981 to 2010
2011 and beyond
To help clients with this transition, IBM's Smarter Computing initiative has three main components. This is a corporate-wide strategy, with systems, software and services all working together to realize results.
The first component is Big Data. This combines three different sources of data:
Traditional structured data in OLTP databases and OLAP data warehouses, using data management solutions like DB2 and IBM Netezza.
Unstructured data, including text documents, images, audio, and video, processed with massive parallelism using IBM BigInsights and Apache Hadoop.
Real-Time Analytics Processing (RTAP) of incoming data, including video surveillance, social media, RFID chips, smart meters, and traffic control systems, processed with IBM InfoSphere Streams
Of course, Big Data will bring new opportunities on the storage front, which I will save for a future post!
Rather than general purpose IT equipment, we have now the scale and scope to specialize with systems optimized for particular workloads, the second component of the Smarter Computing initiative. Of course, IBM has been delivering integrated stacks of systems, software and services for decades now, but it is important to remind people of this, as IBM now has a spate of competitors all trying to follow IBM's lead in this arena.
As with Big Data, the focus on Optimized Systems has impacted IBM's strategy on storage as well. I'll save that discussion for a future post as well!
I am glad that nearly all of the storage vendors have standardized to a common definition for Cloud, the third component of Smarter Computing, which shows that this concept has matured:
Cloud computing is a pay-per-use model for enabling network access to a pool of computing resources that can be provisioned and released rapidly with minimal management effort or service provider interaction. -- U.S. National Institute of Standards and Technology [nist.gov]
Of course, Cloud is just an evolution of IBM's Service Bureau business of the 1960s and 1970s, renting out time-sharing on mainframe systems, Grid Computing of the 1980s, and Application Service Providers that popped up in the 1990s. While the [butchers, bakers and candlestick makers] that IBM competes against might focus their efforts on just private cloud or just public cloud, IBM recognizes the reality is that different clients will need different solutions. Rather than rip-and-replace, IBM will help clients transition to cloud via inclusive solutions that adopt a hybrid approach:
Traditional enterprise with private cloud deployments, using solutions like IBM CloudBurst, SONAS and Information Archive
Traditional enterprise with public cloud services to handle seasonable peaks, providing offsite resiliency, and solutions for a mobile workforce
Hybrid clouds that blend private and public cloud services, to handle seasonal peak workloads, remote and branch offices
IBM's emphasis on IT Infrastructure Library (ITIL), Tivoli and Maximo products will play well in this space to provide integrated service management across traditional and cloud deployments. This is why IBM decided to launch Smarter Computing initiative at Pulse 2011 conference, the industry's premiere conference on intergrated service management.
The IBM Watson that competed on Jeopardy! is an excellent example of all three components of Smarter Computing at work.
IBM Watson was able to respond to Jeopardy! clues within three seconds, processing a combination of database searches with DB2 and text-mining analytics of unstructured data with IBM BigInsights.
IBM Watson combined servers, software and storage into an integrated supercomputer that was optimized for one particular workload: playing Jeopardy!
IBM Watson used many technologies prevalent in private and public cloud computing systems, storing its data on a modified version of SONAS for storage, using xCat administration tools, networking across 10GbE Ethernet, and massive parallel processing through lots of PowerVM guest images.
It's Tuesday again, and that means one thing.... IBM Announcements! On the heels of [last week's announcements], IBM announced some additional products of interest to storage administrators.
IBM Information Archive
Back in 2008, IBM [unveiled the Information Archive]. This storage solution provides automated policy-based tiering between disk and tape, with non-erasable non-rewriteable enforcement to protect against unethical tampering of data. The initial release supported [both files and object storage], with support for different collections, each with its own set of policies for management. However, it only supported NFS initially for the file protocol. Today, IBM announces the addition of CIFS protocol support, which will be especially helpful in healthcare and life sciences, as much of the medical equipment is designed for CIFS protocol storage.
Also, Information Archive will now provide a full index and search feature capability to help with e-Discovery. Searches and retrievals can be done in the background without disrupting applications or the archiving operations.
IBM Tivoli Storage Manager for Virtual Environments V6.2 extends capabilities that currently exist in IBM Tivoli Storage Manager. TSM backup/archive clients run fine on guest operating systems, but now this new extension improves backup for VMware environments. TSM provides incremental block-level backups utilizing VMware's vStorage APIs for Data Protection and Changed Block Tracking features.
To minimize impact to the VMware host, TSM for VE make use of non-disruptive snapshots and offload the backup processing to a vStorage backup server. This supports file-level recovery, volume-level recovery, and full VM recovery. Of course, since it is based on TSM v6, you get advanced storage efficiency features such as compression and deduplication to minimize consumption of disk storage pools.
IBM Tivoli Monitor has been extended to support virtual servers, including VMware, Linux KVM, and Citrix XenServer. This can help with capacity planning, performance monitoring, and availability. Tivoli Monitor will help you understand the relationships between physical and virtual resources to help isolate problems to the correct resource, reducing the time it takes for debug issues between servers and storage. See the
Next week is [IBM Pulse2011 Conference] in Las Vegas, February 27 to March 2. Sorry, I don't plan to be there this year. It is looking to be a great conference, with fellow inventor Dean Kamen as the keynote speaker. For a blast from the past, read my blog posts from Pulse2008 [Main Tent sessions] and [Breakout sessions].
Wrapping up my week's theme of storage optimization, I thought I would help clarify the confusion between data reduction and storage efficiency. I have seen many articles and blog posts that either use these two terms interchangeably, as if they were synonyms for each other, or as if one is merely a subset of the other.
Data Reduction is LOSSY
By "Lossy", I mean that reducing data is an irreversible process. Details are lost, but insight is gained. In his paper, [Data Reduction Techniques", Rajana Agarwal defines this simply:
"Data reduction techniques are applied where the goal is to aggregate or amalgamate the information contained in large data sets into manageable (smaller) information nuggets."
Data reduction has been around since the 18th century.
Take for example this histogram from [SearchSoftwareQuality.com]. We have reduced ninety individual student scores, and reduced them down to just five numbers, the counts in each range. This can provide for easier comprehension and comparison with other distributions.
The process is lossy. I cannot determine or re-create an individual student's score from these five histogram values.
This next example, complements of [Michael Hardy], represents another form of data reduction known as ["linear regression analysis"]. The idea is to take a large set of data points between two variables, the x axis along the horizontal and the y axis along the vertical, and find the best line that fits. Thus the data is reduced from many points to just two, slope(a) and intercept(b), resulting in an equation of y=ax+b.
The process is lossy. I cannot determine or re-create any original data point from this slope and intercept equation.
In this last example, from [Yahoo Finance], reduces millions of stock trades to a single point per day, typically closing price, to show the overall growth trend over the course of the past year.
The process is lossy. Even if I knew the low, high and closing price of a particular stock on a particular day, I would not be able to determine or re-create the actual price paid for individual trades that occurred.
Storage Efficiency is LOSSLESS
By contrast, there are many IT methods that can be used to store data in ways that are more efficient, without losing any of the fine detail. Here are some examples:
Thin Provisioning: Instead of storing 30GB of data on 100GB of disk capacity, you store it on 30GB of capacity. All of the data is still there, just none of the wasteful empty space.
Space-efficient Copy: Instead of copying every block of data from source to destination, you copy over only those blocks that have changed since the copy began. The blocks not copied are still available on the source volume, so there is no need to duplicate this data.
Archiving and Space Management: Data can be moved out of production databases and stored elsewhere on disk or tape. Enough XML metadata is carried along so that there is no loss in the fine detail of what each row and column represent.
Data Deduplication: The idea is simple. Find large chunks of data that contain the same exact information as an existing chunk already stored, and merely set a pointer to avoid storing the duplicate copy. This can be done in-line as data is written, or as a post-process task when things are otherwise slow and idle.
When data deduplication first came out, some lawyers were concerned that this was a "lossy" approach, that somehow documents were coming back without some of their original contents. How else can you explain storing 25PB of data on only 1PB of disk?
(In some countries, companies must retain data in their original file formats, as there is concern that converting business documents to PDF or HTML would lose some critical "metadata" information such as modificatoin dates, authorship information, underlying formulae, and so on.)
Well, the concern applies only to those data deduplication methods that calculate a hash code or fingerprint, such as EMC Centera or EMC Data Domain. If the hash code of new incoming data matches the hash code of existing data, then the new data is discarded and assumed to be identical. This is rare, and I have only read of a few occurrences of unique data being discarded in the past five years. To ensure full integrity, IBM ProtecTIER data deduplication solution and IBM N series disk systems chose instead to do full byte-for-byte comparisons.
Compression: There are both lossy and lossless compression techniques. The lossless Lempel-Ziv algorithm is the basis for LTO-DC algorithm used in IBM's Linear Tape Open [LTO] tape drives, the Streaming Lossless Data Compression (SLDC) algorithm used in IBM's [Enterprise-class TS1130] tape drives, and the Adaptive Lossless Data Compression (ALDC) used by the IBM Information Archive for its disk pool collections.
Last month, IBM announced that it was [acquiring Storwize. It's Random Access Compression Engine (RACE) is also a lossless compression algorithm based on Lempel-Ziv. As servers write files, Storwize compresses those files and passes them on to the destination NAS device. When files are read back, Storwize retrieves and decompresses the data back to its original form.
As with tape, the savings from compression can vary, typically from 20 to 80 percent. In other words, 10TB of primary data could take up from 2TB to 8TB of physical space. To estimate what savings you might achieve for your mix of data types, try out the free [Storwize Predictive Modeling Tool].
So why am I making a distinction on terminology here?
Data reduction is already a well-known concept among specific industries, like High-Performance Computing (HPC) and Business Analytics. IBM has the largest marketshare in supercomputers that do data reduction for all kinds of use cases, for scientific research, weather prediction, financial projections, and decision support systems. IBM has also recently acquired a lot of companies related to Business Analytics, such as Cognos, SPSS, CoreMetrics and Unica Corp. These use data reduction on large amounts of business and marketing data to help drive new sources of revenues, provide insight for new products and services, create more focused advertising campaigns, and help understand the marketplace better.
There are certainly enough methods of reducing the quantity of storage capacity consumed, like thin provisioning, data deduplication and compression, to warrant an "umbrella term" that refers to all of them generically. I would prefer we do not "overload" the existing phrase "data reduction" but rather come up with a new phrase, such as "storage efficiency" or "capacity optimization" to refer to this category of features.
IBM is certainly quite involved in both data reduction as well as storage efficiency. If any of my readers can suggest a better phrase, please comment below.
By combining multiple components into a single "integrated system", IBM can offer a blended disk-and-tape storage solutions. This provides the best of both worlds, high speed access using disk, while providing lower costs and more energy efficiency with tape. According to a study by the Clipper Group, tape can be 23 times less expensive than disk over a 5 year total cost of ownership (TCO).
I've also covered Hierarchical Storage Management, such as my post [Seven Tiers of Storage at ABN Amro], and my role as lead architect for DFSMS on z/OS in general, and DFSMShsm in particular.
However, some explanation might be warranted in the use of these two terms in regards to SONAS. In this case, ILM refers to policy-based file placement, movement and expiration on internal disk pools. This is actually a GPFS feature that has existed for some time, and was tested to work in this new configuration. Files can be individually placed on either SAS (15K RPM) or SATA (7200 RPM) drives. Policies can be written to move them from SAS to SATA based on size, age and days non-referenced.
HSM is also a form of ILM, in that it moves data from SONAS disk to external storage pools managed by IBM Tivoli Storage Manager. A small stub is left behind in the GPFS file system indicating the file has been "migrated". Any reference to read or update this file will cause the file to be "recalled" back from TSM to SONAS for processing. The external storage pools can be disk, tape or any other media supported by TSM. Some estimate that as much as 60 to 80 percent of files on NAS have low reference and should be stored on tape instead of disk, and now SONAS with HSM makes that possible.
This distinction allows the ILM movement to be done internally, within GPFS, and the HSM movement to be done externally, via TSM. Both ILM and HSM movement take advantage of the GPFS high-speed policy engine, which can process 10 million files per node, run in parallel across all interface nodes. Note that TSM is not required for ILM movement. In effect, SONAS brings the policy-based management features of DFSMS for z/OS mainframe to all the rest of the operating systems that access SONAS.
HTTP and NIS support
In addition to NFS v2, NFS v3, and CIFS, the SONAS v1.1.1 adds the HTTP protocol. Over time, IBM plans to add more protocols in subsequent releases. Let me know which protocols you are interested in, so I can pass that along to the architects designing future releases!
SONAS v1.1.1 also adds support for Network Information Service (NIS), a client/server based model for user administration. In SONAS, NIS is used for netgroup and ID mapping only. Authentication is done via Active Directory, LDAP or Samba PDC.
SONAS already had synchronous replication, which was limited in distance. Now, SONAS v1.1.1 provides asynchronous replication, using rsync, at the file level. This is done over Wide Area Network (WAN) across to any other SONAS at any distance.
Interface modules can now be configured with either 64GB or 128GB of cache. Storage now supports both 450GB and 600GB SAS (15K RPM) and both 1TB and 2TB SATA (7200 RPM) drives. However, at this time, an entire 60-drive drawer must be either all one type of SAS or all one type of SATA. I have been pushing the architects to allow each 10-pack RAID rank to be independently selectable. For now, a storage pod can have 240 drives, 60 drives of each type of disk, to provide four different tiers of storage. You can have up to 30 storage pods per SONAS, for a total of 7200 drives.
An alternative to internal drawers of disk is a new "Gateway" iRPQ that allows the two storage nodes of a SONAS storage pod to connect via Fibre Channel to one or two XIV disk systems. You cannot mix and match, a storage pod is either all internal disk, or all external XIV. A SONAS gateway combined with external XIV is referred to as a "Smart Business Storage Cloud" (SBSC), which can be configured off premises and managed by third-party personnel so your IT staff can focus on other things.
See the Announcement Letters for the SONAS [hardware] and [software] for more details.
For those who are wondering how this positions against IBM's other NAS solution, the IBM System Storage N series, the rule of thumb is simple. If your capacity needs can be satisfied with a single N series box per location, use that. If not, consider SONAS instead. For those with non-IBM NAS filers that realize now that SONAS is a better approach, IBM offers migration services.
Both the Information Archive and the SONAS can be accessed from z/OS or Linux on System z mainframe, from "IBM i", AIX and Linux on POWER systems, all x86-based operating systems that run on System x servers, as well as any non-IBM server that has a supported NAS client.
Continuing my coverage of the Data Center Conference, held Dec 1-4 in Las Vegas, an analyst presented the challenges of managing the rapid growth in storage capacity. Administrators ability to manage storage is not keeping up with the growth. His recommendations:
Aim to just meet but not exceed service level agreements (SLAs)
Revisit past IT decisions. This includes evaluating your SAN to NAS ratio.
Embrace new technologies when they are effective, this includes cloud storage, solid state drives, and interconnect technologies like FCoCEE.
Follow vendor management best practices, update your vendor "short list".
A survey of the audience found:
20 percent have a single external storage vendor
39 percent have two external storage vendors
18 percent have three external storage vendors
23 percent have four or more external storage vendors
Throughout the industry, storage vendors are following IBM's example of using commodity hardware parts. This is because custom ASICs are expensive, and changes take a minimum of three months development time. Software-based implementations can be updated more quickly.
In terms of technologies deployed of SAN, NAS, Compliance Archive (such as the IBM Information Archive), and Virtual Tape Library (VTL) such as the IBM TS7650 ProtecTIER data deduplications solution, here was the survey of the audience:
8 percent: SAN only
14 percent: SAN and NAS
23 percent: SAN, NAS and Compliance Archive
9 percent: SAN and VTL
14 percent: SAN, NAS and VTL
32 percent: SAN, NAS, Compliance Archive and VTL
Cost reduction techniques including thin provisioning, compression, data deduplication, Quality of Service tiers, and archiving. To reduce power and cooling requirements, switch from FC to SATA disk wherever possible, and move storage out of the data center, such as on tape cartridges or cloud storage.
For emerging technologies, the following survey:
16 percent have already implemented a new emerging technology (IBM XIV, Pillar, 3PAR, etc.)
30 percent plan to do so in 12-24 months
4 percent plan to do so in 24-48 months
50 percent have no plans, and will continue to stick with traditional storage technologies
As for adopting Cloud storage, here was the survey:
14 percent already have
31 percent plan to use Cloud storage in 12-24 months
13 percent plan to use Cloud storage in 24-48 months
42 percent have no plans to adopt Cloud storage
My take-away from this is that many companies are still "exploring" into different options available to them. Fortunately, IBM offers a broad portfolio of complete end-to-end solutions to make acquiring the right mix of technologies that are optimized for your workloads possible.