Continuing my coverage of the Data Center Conference, held Dec 1-4 in Las Vegas, an analyst presented the challenges of managing the rapid growth in storage capacity. Administrators ability to manage storage is not keeping up with the growth. His recommendations:
- Aim to just meet but not exceed service level agreements (SLAs)
- Revisit past IT decisions. This includes evaluating your SAN to NAS ratio.
- Embrace new technologies when they are effective, this includes cloud storage, solid state drives, and interconnect technologies like FCoCEE.
- Follow vendor management best practices, update your vendor "short list".
A survey of the audience found:
- 20 percent have a single external storage vendor
- 39 percent have two external storage vendors
- 18 percent have three external storage vendors
- 23 percent have four or more external storage vendors
Throughout the industry, storage vendors are following IBM's example of using commodity hardware parts. This is because custom ASICs are expensive, and changes take a minimum of three months development time. Software-based implementations can be updated more quickly.
In terms of technologies deployed of SAN, NAS, Compliance Archive (such as the IBM Information Archive), and Virtual Tape Library (VTL) such as the IBM TS7650 ProtecTIER data deduplications solution, here was the survey of the audience:
- 8 percent: SAN only
- 14 percent: SAN and NAS
- 23 percent: SAN, NAS and Compliance Archive
- 9 percent: SAN and VTL
- 14 percent: SAN, NAS and VTL
- 32 percent: SAN, NAS, Compliance Archive and VTL
Cost reduction techniques including thin provisioning, compression, data deduplication, Quality of Service tiers, and archiving. To reduce power and cooling requirements, switch from FC to SATA disk wherever possible, and move storage out of the data center, such as on tape cartridges or cloud storage.
For emerging technologies, the following survey:
- 16 percent have already implemented a new emerging technology (IBM XIV, Pillar, 3PAR, etc.)
- 30 percent plan to do so in 12-24 months
- 4 percent plan to do so in 24-48 months
- 50 percent have no plans, and will continue to stick with traditional storage technologies
As for adopting Cloud storage, here was the survey:
- 14 percent already have
- 31 percent plan to use Cloud storage in 12-24 months
- 13 percent plan to use Cloud storage in 24-48 months
- 42 percent have no plans to adopt Cloud storage
My take-away from this is that many companies are still "exploring" into different options available to them. Fortunately, IBM offers a broad portfolio of complete end-to-end solutions to make acquiring the right mix of technologies that are optimized for your workloads possible.
technorati tags: , XIV, Cloud Storage, SAN, NAS, VTL, ProtecTIER, Information Archive
Continuing this week's discussion on IBM announcements, today I'll cover our integrated systems.
The problem with spreading out these announcements across several days' worth of blog posts is that others beat you to the punch. Fellow blogger Richard Swain (IBM) has his post [Move that File], and TechTarget's Dave Raffo has an article titled [
"IBM SONAS gains policy-driven tiering, gateway to IBM XIV Storage System"].
By combining multiple components into a single "integrated system", IBM can offer a blended disk-and-tape storage solutions. This provides the best of both worlds, high speed access using disk, while providing lower costs and more energy efficiency with tape. According to a study by the Clipper Group, tape can be 23 times less expensive than disk over a 5 year total cost of ownership (TCO).
The two we introduced recently were the [IBM Information Archive] and the Scale-Out Network Attached Storage (SONAS). This week, IBM announced some enhancements as SONAS v1.1.1 release. SONAS is the productized version of IBM's Scale-Out File Services (SoFS), which I discussed in my posts [Area Rugs versus Wall-to-Wall Carpeting] and [More details about IBM's Clustered Scalable NAS].
- ILM and HSM data movement
I have covered Information Lifecycle Management several times in this post, including my posts [ILM for my iPod], [Times a Million], and [Using ILM to Save Trees], to name a few.
I've also covered Hierarchical Storage Management, such as my post [Seven Tiers of Storage at ABN Amro], and my role as lead architect for DFSMS on z/OS in general, and DFSMShsm in particular.
However, some explanation might be warranted in the use of these two terms in regards to SONAS. In this case, ILM refers to policy-based file placement, movement and expiration on internal disk pools. This is actually a GPFS feature that has existed for some time, and was tested to work in this new configuration. Files can be individually placed on either SAS (15K RPM) or SATA (7200 RPM) drives. Policies can be written to move them from SAS to SATA based on size, age and days non-referenced.
HSM is also a form of ILM, in that it moves data from SONAS disk to external storage pools managed by IBM Tivoli Storage Manager. A small stub is left behind in the GPFS file system indicating the file has been "migrated". Any reference to read or update this file will cause the file to be "recalled" back from TSM to SONAS for processing. The external storage pools can be disk, tape or any other media supported by TSM. Some estimate that as much as 60 to 80 percent of files on NAS have low reference and should be stored on tape instead of disk, and now SONAS with HSM makes that possible.
This distinction allows the ILM movement to be done internally, within GPFS, and the HSM movement to be done externally, via TSM. Both ILM and HSM movement take advantage of the GPFS high-speed policy engine, which can process 10 million files per node, run in parallel across all interface nodes. Note that TSM is not required for ILM movement. In effect, SONAS brings the policy-based management features of DFSMS for z/OS mainframe to all the rest of the operating systems that access SONAS.
- HTTP and NIS support
In addition to NFS v2, NFS v3, and CIFS, the SONAS v1.1.1 adds the HTTP protocol. Over time, IBM plans to add more protocols in subsequent releases. Let me know which protocols you are interested in, so I can pass that along to the architects designing future releases!
SONAS v1.1.1 also adds support for Network Information Service (NIS), a client/server based model for user administration. In SONAS, NIS is used for netgroup and ID mapping only. Authentication is done via Active Directory, LDAP or Samba PDC.
- Asynchronous Replication
SONAS already had synchronous replication, which was limited in distance. Now, SONAS v1.1.1 provides asynchronous replication, using rsync, at the file level. This is done over Wide Area Network (WAN) across to any other SONAS at any distance.
- Hardware enhancements
Interface modules can now be configured with either 64GB or 128GB of cache. Storage now supports both 450GB and 600GB SAS (15K RPM) and both 1TB and 2TB SATA (7200 RPM) drives. However, at this time, an entire 60-drive drawer must be either all one type of SAS or all one type of SATA. I have been pushing the architects to allow each 10-pack RAID rank to be independently selectable. For now, a storage pod can have 240 drives, 60 drives of each type of disk, to provide four different tiers of storage. You can have up to 30 storage pods per SONAS, for a total of 7200 drives.
An alternative to internal drawers of disk is a new "Gateway" iRPQ that allows the two storage nodes of a SONAS storage pod to connect via Fibre Channel to one or two XIV disk systems. You cannot mix and match, a storage pod is either all internal disk, or all external XIV. A SONAS gateway combined with external XIV is referred to as a "Smart Business Storage Cloud" (SBSC), which can be configured off premises and managed by third-party personnel so your IT staff can focus on other things.
See the Announcement Letters for the SONAS [hardware] and [software] for more details.
For those who are wondering how this positions against IBM's other NAS solution, the IBM System Storage N series, the rule of thumb is simple. If your capacity needs can be satisfied with a single N series box per location, use that. If not, consider SONAS instead. For those with non-IBM NAS filers that realize now that SONAS is a better approach, IBM offers migration services.
Both the Information Archive and the SONAS can be accessed from z/OS or Linux on System z mainframe, from "IBM i", AIX and Linux on POWER systems, all x86-based operating systems that run on System x servers, as well as any non-IBM server that has a supported NAS client.
technorati tags: , IBM, Announcements, SONAS, SoFS, Information+Archive, Richard Swain, TechTarget, ILM, HSM, storage tiers, GPFS, TSM, HTTP, NIS, TSM, NAS, iRPQ, XIV, SBSC, z/OS, Linux, AIX
Wrapping up my week's theme of storage optimization, I thought I would help clarify the confusion between data reduction and storage efficiency. I have seen many articles and blog posts that either use these two terms interchangeably, as if they were synonyms for each other, or as if one is merely a subset of the other.
- Data Reduction is LOSSY
By "Lossy", I mean that reducing data is an irreversible process. Details are lost, but insight is gained. In his paper, [Data Reduction Techniques", Rajana Agarwal defines this simply:
"Data reduction techniques are applied where the goal is to aggregate or amalgamate the information contained in large data sets into manageable (smaller) information nuggets."
Data reduction has been around since the 18th century.
Take for example this histogram from [SearchSoftwareQuality.com]. We have reduced ninety individual student scores, and reduced them down to just five numbers, the counts in each range. This can provide for easier comprehension and comparison with other distributions.
The process is lossy. I cannot determine or re-create an individual student's score from these five histogram values.
This next example, complements of [Michael Hardy], represents another form of data reduction known as ["linear regression analysis"]. The idea is to take a large set of data points between two variables, the x axis along the horizontal and the y axis along the vertical, and find the best line that fits. Thus the data is reduced from many points to just two, slope(a) and intercept(b), resulting in an equation of y=ax+b.
The process is lossy. I cannot determine or re-create any original data point from this slope and intercept equation.
In this last example, from [Yahoo Finance], reduces millions of stock trades to a single point per day, typically closing price, to show the overall growth trend over the course of the past year.
The process is lossy. Even if I knew the low, high and closing price of a particular stock on a particular day, I would not be able to determine or re-create the actual price paid for individual trades that occurred.
- Storage Efficiency is LOSSLESS
By contrast, there are many IT methods that can be used to store data in ways that are more efficient, without losing any of the fine detail. Here are some examples:
- Thin Provisioning: Instead of storing 30GB of data on 100GB of disk capacity, you store it on 30GB of capacity. All of the data is still there, just none of the wasteful empty space.
- Space-efficient Copy: Instead of copying every block of data from source to destination, you copy over only those blocks that have changed since the copy began. The blocks not copied are still available on the source volume, so there is no need to duplicate this data.
- Archiving and Space Management: Data can be moved out of production databases and stored elsewhere on disk or tape. Enough XML metadata is carried along so that there is no loss in the fine detail of what each row and column represent.
- Data Deduplication: The idea is simple. Find large chunks of data that contain the same exact information as an existing chunk already stored, and merely set a pointer to avoid storing the duplicate copy. This can be done in-line as data is written, or as a post-process task when things are otherwise slow and idle.
When data deduplication first came out, some lawyers were concerned that this was a "lossy" approach, that somehow documents were coming back without some of their original contents. How else can you explain storing 25PB of data on only 1PB of disk?
(In some countries, companies must retain data in their original file formats, as there is concern that converting business documents to PDF or HTML would lose some critical "metadata" information such as modificatoin dates, authorship information, underlying formulae, and so on.)
Well, the concern applies only to those data deduplication methods that calculate a hash code or fingerprint, such as EMC Centera or EMC Data Domain. If the hash code of new incoming data matches the hash code of existing data, then the new data is discarded and assumed to be identical. This is rare, and I have only read of a few occurrences of unique data being discarded in the past five years. To ensure full integrity, IBM ProtecTIER data deduplication solution and IBM N series disk systems chose instead to do full byte-for-byte comparisons.
- Compression: There are both lossy and lossless compression techniques. The lossless Lempel-Ziv algorithm is the basis for LTO-DC algorithm used in IBM's Linear Tape Open [LTO] tape drives, the Streaming Lossless Data Compression (SLDC) algorithm used in IBM's [Enterprise-class TS1130] tape drives, and the Adaptive Lossless Data Compression (ALDC) used by the IBM Information Archive for its disk pool collections.
Last month, IBM announced that it was [acquiring Storwize. It's Random Access Compression Engine (RACE) is also a lossless compression algorithm based on Lempel-Ziv. As servers write files, Storwize compresses those files and passes them on to the destination NAS device. When files are read back, Storwize retrieves and decompresses the data back to its original form.
To read independent views on IBM's acquisition, read Lauren Whitehouse (ESG) post [Remote Another Chair, Chris Mellor (The Register) article [Storwize Swallowed], or Dave Raffo (SearchStorage.com) article [IBM buys primary data compression].
As with tape, the savings from compression can vary, typically from 20 to 80 percent. In other words, 10TB of primary data could take up from 2TB to 8TB of physical space. To estimate what savings you might achieve for your mix of data types, try out the free [Storwize Predictive Modeling Tool].
So why am I making a distinction on terminology here?
Data reduction is already a well-known concept among specific industries, like High-Performance Computing (HPC) and Business Analytics. IBM has the largest marketshare in supercomputers that do data reduction for all kinds of use cases, for scientific research, weather prediction, financial projections, and decision support systems. IBM has also recently acquired a lot of companies related to Business Analytics, such as Cognos, SPSS, CoreMetrics and Unica Corp. These use data reduction on large amounts of business and marketing data to help drive new sources of revenues, provide insight for new products and services, create more focused advertising campaigns, and help understand the marketplace better.
There are certainly enough methods of reducing the quantity of storage capacity consumed, like thin provisioning, data deduplication and compression, to warrant an "umbrella term" that refers to all of them generically. I would prefer we do not "overload" the existing phrase "data reduction" but rather come up with a new phrase, such as "storage efficiency" or "capacity optimization" to refer to this category of features.
IBM is certainly quite involved in both data reduction as well as storage efficiency. If any of my readers can suggest a better phrase, please comment below.
technorati tags: IBM, data reduction, storage efficiency, histogram, linear regression, thin provisioning, data deduplication, lossy, lossless, EMC, Centera, hash collisions, Information Archive, LTO, LTO-DC, SLDC, ALDC, compression, deduplication, Storwize, supercomputers, HPC, analytics
It's Tuesday again, and that means one thing.... IBM Announcements! On the heels of [last week's announcements], IBM announced some additional products of interest to storage administrators.
- IBM Information Archive
Back in 2008, IBM [unveiled the Information Archive]. This storage solution provides automated policy-based tiering between disk and tape, with non-erasable non-rewriteable enforcement to protect against unethical tampering of data. The initial release supported [both files and object storage], with support for different collections, each with its own set of policies for management. However, it only supported NFS initially for the file protocol. Today, IBM announces the addition of CIFS protocol support, which will be especially helpful in healthcare and life sciences, as much of the medical equipment is designed for CIFS protocol storage.
Also, Information Archive will now provide a full index and search feature capability to help with e-Discovery. Searches and retrievals can be done in the background without disrupting applications or the archiving operations.
To learn more, read the [announcement letter].
- IBM Tivoli Storage Manager
IBM Tivoli Storage Manager for Virtual Environments V6.2 extends capabilities that currently exist in IBM Tivoli Storage Manager. TSM backup/archive clients run fine on guest operating systems, but now this new extension improves backup for VMware environments. TSM provides incremental block-level backups utilizing VMware's vStorage APIs for Data Protection and Changed Block Tracking features.
To minimize impact to the VMware host, TSM for VE make use of non-disruptive snapshots and offload the backup processing to a vStorage backup server. This supports file-level recovery, volume-level recovery, and full VM recovery. Of course, since it is based on TSM v6, you get advanced storage efficiency features such as compression and deduplication to minimize consumption of disk storage pools.
To learn more, see the [announcement letter].
- IBM Tivoli Monitoring for Virtual Servers V6.2.3
IBM Tivoli Monitor has been extended to support virtual servers, including VMware, Linux KVM, and Citrix XenServer. This can help with capacity planning, performance monitoring, and availability. Tivoli Monitor will help you understand the relationships between physical and virtual resources to help isolate problems to the correct resource, reducing the time it takes for debug issues between servers and storage. See the
Next week is [IBM Pulse2011 Conference] in Las Vegas, February 27 to March 2. Sorry, I don't plan to be there this year. It is looking to be a great conference, with fellow inventor Dean Kamen as the keynote speaker. For a blast from the past, read my blog posts from Pulse2008 [Main Tent sessions] and [Breakout sessions].
technorati tags: IBM, #ibmpulse, Information Archive, Tivoli, TSM, Tivoli Monitor, VMware, LInux, KVM, Citrix, XenServer
Wrapping up my week's coverage of the IBM Pulse 2011 conference, I have had several people ask me to explain IBM's latest initiative, Smarter Computing, which IBM launched this week at this conference. Having led the IT industry through the Centralized Computing era and the Distributed Computing era, IBM is now well-positioned to help companies, governments and non-profit organizations to enter the new Smarter Computing era, focused on insight and discovery.
|Centralized Computing||Distributed Computing||Smarter Computing|
- Thousands of IT professionals
- Mainframe servers
- Effiicent, but only the largest companies and governments had them
- Millions of office workers
- Personal computers (PC)
- Innovative, extending the reach to small and medium-sized businesses, but resulted in server sprawl and increased TCO
- Billions of people
- Smart phones and other handheld devices
- Efficient and Innovative, combining the best of centralized and distributed computing
|1952 to 1980||1981 to 2010||2011 and beyond|
To help clients with this transition, IBM's Smarter Computing initiative has three main components. This is a corporate-wide strategy, with systems, software and services all working together to realize results.
- Big Data
The first component is Big Data. This combines three different sources of data:
- Traditional structured data in OLTP databases and OLAP data warehouses, using data management solutions like DB2 and IBM Netezza.
- Unstructured data, including text documents, images, audio, and video, processed with massive parallelism using IBM BigInsights and Apache Hadoop.
- Real-Time Analytics Processing (RTAP) of incoming data, including video surveillance, social media, RFID chips, smart meters, and traffic control systems, processed with IBM InfoSphere Streams
Of course, Big Data will bring new opportunities on the storage front, which I will save for a future post!
- Optimized Systems
Rather than general purpose IT equipment, we have now the scale and scope to specialize with systems optimized for particular workloads, the second component of the Smarter Computing initiative. Of course, IBM has been delivering integrated stacks of systems, software and services for decades now, but it is important to remind people of this, as IBM now has a spate of competitors all trying to follow IBM's lead in this arena.
As with Big Data, the focus on Optimized Systems has impacted IBM's strategy on storage as well. I'll save that discussion for a future post as well!
I am glad that nearly all of the storage vendors have standardized to a common definition for Cloud, the third component of Smarter Computing, which shows that this concept has matured:
Cloud computing is a pay-per-use model for enabling network access to a pool of computing resources that can be provisioned and released rapidly with minimal management effort or service provider interaction.
-- U.S. National Institute of Standards and Technology [nist.gov]
Of course, Cloud is just an evolution of IBM's Service Bureau business of the 1960s and 1970s, renting out time-sharing on mainframe systems, Grid Computing of the 1980s, and Application Service Providers that popped up in the 1990s. While the [butchers, bakers and candlestick makers] that IBM competes against might focus their efforts on just private cloud or just public cloud, IBM recognizes the reality is that different clients will need different solutions. Rather than rip-and-replace, IBM will help clients transition to cloud via inclusive solutions that adopt a hybrid approach:
- Traditional enterprise with private cloud deployments, using solutions like IBM CloudBurst, SONAS and Information Archive
- Traditional enterprise with public cloud services to handle seasonable peaks, providing offsite resiliency, and solutions for a mobile workforce
- Hybrid clouds that blend private and public cloud services, to handle seasonal peak workloads, remote and branch offices
IBM's emphasis on IT Infrastructure Library (ITIL), Tivoli and Maximo products will play well in this space to provide integrated service management across traditional and cloud deployments. This is why IBM decided to launch Smarter Computing initiative at Pulse 2011 conference, the industry's premiere conference on intergrated service management.
The IBM Watson that competed on Jeopardy! is an excellent example of all three components of Smarter Computing at work.
- IBM Watson was able to respond to Jeopardy! clues within three seconds, processing a combination of database searches with DB2 and text-mining analytics of unstructured data with IBM BigInsights.
- IBM Watson combined servers, software and storage into an integrated supercomputer that was optimized for one particular workload: playing Jeopardy!
- IBM Watson used many technologies prevalent in private and public cloud computing systems, storing its data on a modified version of SONAS for storage, using xCat administration tools, networking across 10GbE Ethernet, and massive parallel processing through lots of PowerVM guest images.
technorati tags: IBM, Pulse, ibmpulse, Centralized Computing, Distributed Computing, Smarter Computing, Big Data, Optimized Systems, Cloud Computing, SONAS, Netezza, DB2, InfoSphere, BigInsights, SPSS, Data Warehouse, Structured Data, Unstructured Data, Watson, CloudBurst, Information Archive