This blog is for the open exchange of ideas relating to IBM Systems, storage and storage networking hardware, software and services.
(Short URL for this blog: ibm.co/Pearson )
Tony Pearson is a Master Inventor, Senior IT Architect and Event Content Manager for [IBM Systems for IBM Systems Technical University] events. With over 30 years with IBM Systems, Tony is frequent traveler, speaking to clients at events throughout the world.
Lloyd Dean is an IBM Senior Certified Executive IT Architect in Infrastructure Architecture. Lloyd has held numerous senior technical roles at IBM during his 19 plus years at IBM. Lloyd most recently has been leading efforts across the Communication/CSI Market as a senior Storage Solution Architect/CTS covering the Kansas City territory. In prior years Lloyd supported the industry accounts as a Storage Solution architect and prior to that as a Storage Software Solutions specialist during his time in the ATS organization.
Lloyd currently supports North America storage sales teams in his Storage Software Solution Architecture SME role in the Washington Systems Center team. His current focus is with IBM Cloud Private and he will be delivering and supporting sessions at Think2019, and Storage Technical University on the Value of IBM storage in this high value IBM solution a part of the IBM Cloud strategy. Lloyd maintains a Subject Matter Expert status across the IBM Spectrum Storage Software solutions. You can follow Lloyd on Twitter @ldean0558 and LinkedIn Lloyd Dean.
Tony Pearson's books are available on Lulu.com! Order your copies today!
Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.
Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.
Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.
Tony Pearson is not a medical doctor, and this blog does not reference any IBM product or service that is intended for use in the diagnosis, treatment, cure, prevention or monitoring of a disease or medical condition, unless otherwise specified on individual posts.
The developerWorks Connections Platform is now in read-only mode and content is only available for viewing. No new wiki pages, posts, or messages may be added. Please see our FAQ for more information. The developerWorks Connections platform will officially shut down on March 31, 2020 and content will no longer be available. More details available on our FAQ. (Read in Japanese.)
Optimizing Storage Infrastructure for Growth and Innovation
This session started off with my former boss, Brian Truskowski, IBM General Manager of System Storage and Networking.
We've come a long way in storage. In 1973, the "Winchester Drive" was named after the famous Winchester 3030 rifle. The disk drive was planning to have two 30MB platters, hence the name. When it finally launched, it would have two 35MB platters, for a total raw capacity of 70MB.
Today, IBM announced the verison 6.2 of SAN Volume Controller with support for 10GbE iSCSI. Since 2003, IBM has sold over 30,000 SAN Volume Controllers. An SVC cluster can now manage up to 32PB of disk storage.
IBM also announced new 4TB tape drive (TS1140), LTFS Library Edition, the TS3500 Library Connector, improved TS7600 and TS7700 virtual tape libraries, enhanced Information Archive for email, files and eDiscovery, new Storwize V7000 hardware, new Storwize Rapid Application bundles, new firmware for SONAS and DS8000 disk systems, and Real-Time Compression support for EMC disk systems. I plan to cover each of these in follow-on posts, but if you can't wait, here are [links to all the announcements].
Customer Testimonial - CenterPoint Energy
"CenterPoint is transforming its business from being an energy distribution company that uses technology, to a technology company that distributes energy."
-- Dr. Steve Pratt, CTO of CenterPoint Energy
The next speaker was Dr. Steve Pratt is CTO of [CenterPoint Energy]. CenterPoint is 110 years old (older than IBM!) energy company that is involved in electricity, gasoline distribution, and natural gas pipeline. CenterPoint serves Houston, Texas (the fourth largest city in the USA) and surrounding area.
CenterPoint are transforming to a Smart Grid involving smart meters, and this requires the best IT infrastructure you can buy, including IBM DS8000, XIV and SAN Volume Controller disk systems, IBM Smart Analytics System, Stream Analytics, IBM Virtual Tape Library, IBM Tivoli Storage Manager, and IBM Tivoli Storage Productivity Center.
Dr. Pratt has seen the transition of information over the years:
Data Structure, deciding how to code data to record it in a structured manner
Information Reporting, reporting to upper management what happened
Intelligence Aggregation, finding patterns and insight from the data
Predictive Analytics, monitoring real-time data to take pro-active steps
Autonomics, where automation and predictive analysis allows the system to manage itself
What does the transition to a Smart Grid mean for their storage environment? They will go from 80,000 meter reads, to 230,400,000 reads per day. Ingestion of this will go from MB/day to GB/sec. Reporting will transition to real-time analytics.
Dr. Pratt prefers to avoid trade-offs. Don't lose something to get something else. He also feels that language of the IT department can help. For example, he uses "Factor" like 25x rather than percent reduction (96 percent reduced). He feels this communicates the actual results more effectively.
Today's smarter consumers are driving the need for smarter technologies. Individual consumers and small businesses can make use of intelligent meters to help reduce their energy costs. Everything from smart cars to smart grids will need real-time analytics to deal with the millions of events that occur every day.
IBM's Data Protection and Retention Story
Brian Truskowski came back to provide the latest IBM messaging for Data Protection and Retention (DP&R). The key themes were:
Stop storing so much
Store more with what's on the floor
Move data to the right place
IBM announced today that the IBM Real-Time Compression Appliances now support EMC gear, such as EMC Celerra. While some of the EMC equipment have built-in compression features, these often come at a cost of performance degradation. Instead, the IBM Real-Time compression can offer improved performance as well as 3x to 5x reduction in storage capacity.
OVer 70 percent of data on disk has not be accessed in the last 90 days. IBM Easy Tier on the DS8700 and DS8800 now support FC-to-SATA automated tiering.
IBM is projecting that backup and archive storage will grow at over 50 percent per year. To help address this, IBM is launching a new "Storage Infrastructure Optimization" assessment. All attendees at today's summit are eligible for a free assessment.
Analytics are increasing the value of information, and making it more accessible to the average knowledge worker. The cost of losing data, as well as the effort spent searching for information, has skyrocketed. Users have grown to expect 100 percent uptime availability.
An analysis of IT environments found that only 55 percent was spent on revenue-producing workloads. The remaining 45 percent was spent on Data Protection and Retenion. That means that for every IT dollar spent on projects to generate revenue, you are spending another 90 cents to protect it. Imagine spending 90 percent of your house payments for homeowners' insurance, or 90 percent of your car's purchase price for car insurance.
IBM has organized its solutions into three categories:
Hyper-Efficient Backup and Recovery
Continuous Data Availability
What would it mean to your business if you could shift some of the money spent on DP&R over to revenue-producing projects instead? That was the teaser question posed at the end of these morning sessions for us to discuss during lunch.
Normally, IBM has its announcements on Tuesdays, but this week it was on Monday!
I am here in New York City, at the Kaufmann Theater of the American Museum of Natural History, for the
[IBM Storage Innovation Executive Summit]. We have about 250 clients here, as well as many bloggers and storage analysts.
My day started out being interviewed by Lynda from Stratecast, a division of [Frost & Sullivan]. This interview will be part of a video series that Stratecast is doing about the storage industry.
(About the venue: American Museum of Natural History was built in 1869. It was featured in the film "Night at the Museum". In keeping with IBM's focus on scalability and preservation, the museum here boasts skeletons of the largest dinosaurs. The five-story building takes up several city blocks, and the Kaufmann theater is buried deep in the bottom level, well shielded from cell phone or Wi-Fi signals allowing me to focus on taking notes the traditional way, with pen and paper.)
Deon Newman, IBM VP of Marketing for Northa America, was our Master of Ceremonies. Today would be filled with market insight, best practices, thought leadership, and testimonials of powerful results.
This is my first in a series of blog posts on this event.
Information Explosion on a Smarter Planet
Bridget van Kralingen, IBM General Manager for North America, indicated that storage is finally having its day in the sun, moving from the "back office" to the "front office". According to Google's Eric Schmidt, we now create, capture and replicate more date in two days than all of the information recorded from the dawn of time to the year 2003.
1928: IBM's innovative 80-column punch card stored nearly twice as much as its 50-column predecessor.
1947: Bing Crosby decided to do his radio show by recording it at his convenience on magnetic tape, rather than doing it live. This was the motivation for IBM researches to investigate tape media, delivering the first commercial tape drive in 1952. One tape reel could hold the equivalent of 30,000 punch cards.
1956: the IBM RAMAC mainframe was the first computer to access data randomly with an externally-attached disk system, the "350 Disk Unit", which stored 5 million 7-bit characters (about 5MB) and weighed over 500 pounds. Compare that today's cell phone that can store several GB of data in a handheld device.
1978: IBM invented Redundant Array of Independent Disks (RAID) through a collaboration with University of Berkeley.
1993: IBM introduces the [IBM 9337 Disk Storage Array], the first external disk storage system for distributed operating systems. This was based on the Serial Storage Architecture [SSA] protocol.
1995: IBM launches products that support Storage Area Networks (SAN), based on the Fibre Channel Protocol. IBM's internal codenames for disk products were all names of sharks, and so our internal mantra was that a healthy storage diet was comprised of "Plenty of Fish and Fibre".
2010: IBM ships Easy Tier, the world's easiest-to-use sub-LUN automated tiering capability, for the IBM System Storage DS8700 disk system.
Storage is growing (in capacity) at 40 percent per year, but IT budgets are only growing (in dollars) by a measly 1 to 5 percent. She cited the success at [Sprint], presented at the October 2010 launch. By combining IBM SAN Volume Controller with a three-tier storage architecture, Sprint lowered their raw capacity from 10PB to 8.4PB, increasing utilization from 35 to 78 percent. This involved shrinking from six storage vendors to three, and reducing total number of disk arrays from 166 down to 96. The resulting system has only 38 percent of their data on their most expensive Tier-1 storage, the rest is now living on less expensive Tier-2 and Tier-3 storage.
Companies are entering the era of Big Data with an insatiable appetite for collecting and analyzing data for marketplace insights. IBM [InfoSphere BigInsights], based on the Apache Hadoop, has helped customers make sense of it all. Innovative technology, expertise and marketplace insight will provide the competitive path forward in the coming decade.
Storage Challenges and Opportunities in 2011 and Beyond
I always enjoy hearing Stan Zaffos, Gartner Research VP, present at the annual [Data Center Conference] in Las Vegas every December. His analysis and research focuses on storage systems and emerging storage technologies.
Stan provided his perspective on the storage industry. He suggested a top-down approach, based on the market trends that Gartner is closely monitoring. He suggests focusing heavily on managing data growth, using SLAs to improve efficiency, and to follow Gartner's recommended actions. His statement, "If something is not sustainable, then it is unsustainable." resonated well with the audience. His key three points:
Design to meet but not exceed Service Level Agreements (SLAs)
Re-evaluate your ratio of SAN versus NAS based on growth of unstructured data content,
Explore the variety of Cloud options available.
Those of us who have been in this business a long time recognize that the problems haven't changed, just the dimensions. When in the past three decades were IT budgets generous and plentiful? When was there more than enough IT staff to handle all the requests in a timely manner? When hasn't there been a period of information growth? Gartner's analysis external control block (RAID protected disk systems) is growing revenue at 8.7 percent. Raw TBs of disk capacity is growing at 55 percent, and expected to be 100 Exabytes by 2015.
SAN has four times more revenue than NAS today, but NAS is growing faster. NAS was only 9 percent marketshare in 2010, but is projected to grow to 32 percent by 2015. SAN can offer higher price/performance for traditional OLTP and database workloads, but NAS is better suited for unstructured data, backups and archives, assisted by storage efficiency features like real-time compression and data deduplication. Which industries create the most unstructured data? The ones involved in filling out forms! This includes government, insurance agencies, manufacturing, mining and pharmaceuticals.
The phrase "good enough" should no longer be considered an insult. Too often IT departments design solutions that far exceed negotiated Service Level Agreements (SLAs), and they should instead focus on just meeting them instead. Modular storage systems are often sufficient for most workloads. Slower 7200RPM SATA disks can be one third the price of faster 15K RPM Fibre Channel drives, and often sufficient performance for the tasks required. Unified storage, such as IBM N series, can help simplify capacity planning, as storage can be re-purposed if different workloads grow at different rates. The key is to focus on meeting SLAs based on the price-vs-risk factor. Take a minimalist approach with fewer SLAs, fewer management classes, and fewer storage vendors.
Stan suggests a two-pronged approach: Capacity management through content analytics and classification, and Efficient Utilization through Thin Provisioning, storage virtualization, Quality of Service (QoS), compression and deduplication capabilities. This features will be ubiquitous by 2013. If you are worried that these technologies mean more information packed onto fewer devices, Stan's response was "If it's not there, it can't break." Storing data on fewer disks or tape cartridges means less chance something will fail.
Stan feels IT shops using Thin Provisioning should continue to charge their end-users on what they ask for (the full allocation request) rather than what the thin-provisioned amount actually is on the storage devices themselves. For example, if someone asks for 100GB LUN to be allocated to their system, but this only takes up 30GB of actual data space, chargeback the full 100GB!
It can take five years for new technology to get 50 percent adopted. The Romans took eight years to build the [Colosseum]. His research on "network convergence" found that 42 percent planned to use iSCSI, 32 percent Fibre Channel over Ethernet (FCoE) or other Top-of-Rack(TOR) converged switches, and 16 percent looking for full convergence of servers, switches and storage. Features like IBM Easy Tier automatic sub-LUN tiering were introduced later, and so have not been adopted as widely as other features like Thin Provisioning that have been around since the 1990's IBM RAMAC Virtual Array.
Stan felt that Public and Private clouds were two different approaches. Public clouds offer reservation-less provisioning. Private clouds offer improved agility, but can be more complex to set up, and has the risk of idle capacity similar to traditional IT datacenter deployments. Storage and File virtualization should be considered a pre-req for adopting Cloud technologies.
Storage IT teams need to adopt more than just technical skills. They need to learn about legal and government regulatory compliance issues, financial considerations, and would even benefit doing some "marketing". Why marketing? Because often IT departments need end-users to change their attitudes and behaviours, and this can be accomplished through internal marketing campaigns.
Over on the Tivoli Storage Blog, there is an exchange over the concept of a "Storage Hypervisor". This started with fellow IBMer Ron Riffe's blog post [Enabling Private IT for Storage Cloud -- Part I], with a promise to provide parts 2 and 3 in the next few weeks. Here's an excerpt:
"Storage resources are virtualized. Do you remember back when applications ran on machines that really were physical servers (all that “physical” stuff that kept everything in one place and slowed all your processes down)? Most folks are rapidly putting those days behind them.
In August, Gartner published a paper [Use Heterogeneous Storage Virtualization as a Bridge to the Cloud] that observed “Heterogeneous storage virtualization devices can consolidate a diverse storage infrastructure around a common access, management and provisioning point, and offer a bridge from traditional storage infrastructures to a private cloud storage environment” (there’s that “cloud” language). So, if I’m going to use a storage hypervisor as a first step toward cloud enabling my private storage environment, what differences should I expect? (good question, we get that one all the time!)
The basic idea behind hypervisors (server or storage) is that they allow you to gather up physical resources into a pool, and then consume virtual slices of that pool until it’s all gone (this is how you get the really high utilization). The kicker comes from being able to non-disruptively move those slices around. In the case of a storage hypervisor, you can move a slice (or virtual volume) from tier to tier, from vendor to vendor, and now, from site to site all while the applications are online and accessing the data. This opens up all kinds of use cases that have been described as “cloud”. One of the coolest is inter-site application migration.
A good storage hypervisor helps you be smart.
Application owners come to you for storage capacity because you’re responsible for the storage at your company. In the old days, if they requested 500GB of capacity, you allocated 500GB off of some tier-1 physical array – and there it sat. But then you discovered storage hypervisors! Now you tell that application owner he has 500GB of capacity… What he really has is a 500GB virtual volume that is thin provisioned, compressed, and backed by lower-tier disks. When he has a few data blocks that get really hot, the storage hypervisor dynamically moves just those blocks to higher tier storage like SSD’s. His virtual disk can be accessed anywhere across vendors, tiers and even datacenters. And in the background you have changed the vendor storage he is actually sitting on twice because you found a better supplier. But he doesn’t know any of this because he only sees the 500GB virtual volume you gave him. It’s 'in the cloud'."
"Let’s start with a quick walk down memory lane. Do you remember what your data protection environment looked like before virtualization? There was a server with an operating system and an application… and that thing had a backup agent on it to capture backup copies and send them someplace (most likely over an IP network) for safe keeping. It worked, but it took a lot of time to deploy and maintain all the agents, a lot of bandwidth to transmit the data, and a lot of disk or tapes to store it all. The topic of data protection has modernized quite a bit since then.
Fast forward to today. Modernization has come from three different sources – the server hypervisor, the storage hypervisor and the unified recovery manager. The end result is a data protection environment that captures all the data it needs in one coordinated snapshot action, efficiently stores those snapshots, and provides for recovery of just about any slice of data you could want. It’s quite the beautiful thing."
At this point, you might scratch your head and ask "Does this Storage Hypervisor exist, or is this just a theoretical exercise?" The answer of course is "Yes, it does exist!" Just like VMware offers vSphere and vCenter, IBM offers block-level disk virtualization through the SAN Volume Controller(SVC) and Storwize V7000 products, with a full management support from Tivoli Storage Productivity Center Standard Edition.
SVC has supported every release of VMware since the 2.5 version. IBM is the leading reseller of VMware, so it makes sense for IBM and VMware development to collaborate and make sure all the products run smoothly together. SVC presents volumes that can be formatted for VMFS file system to hold your VMDK files, accessible via FCP protocol. IBM and VMware have some key synergies:
Management integration with Tivoli Storage Productivity Center and VMware vCenter plug-in
VAAI support: Hardware-assisted locking, hardware-assisted zeroing, and hardware-assisted copying. Some of the competitors, like EMC VPLEX, don't have this!
Space-efficient FlashCopy. Let's say you need 250 VM images, all running a particular level of Windows. A boot volume of 20GB each would consume 5000GB (5 TB) of capacity. Instead, create a Golden Master volume. Then, take 249 copies with space-efficient FlashCopy, which only consumes space for the modified portions of the new volumes. For each copy, make the necessary changes like unique hostname and IP address, changing only a few blocks of data each. The end result? 250 unique VM boot volumes in less than 25GB of space, a 200:1 reduction!
Support for VMware's Site Recovery Manager using SVC's Metro Mirror or Global Mirror features for remote-distance replication.
Data center federation. SVC allows you to seamlessly do vMotion from one datacenter to another using its "stretched cluster" capability. Basically, SVC makes a single image of the volume available to both locations, and stores two physical copies, one in each location. You can lose either datacenter and still have uninterrupted access to your data. VMware's HA or Fault Tolerance features can kick in, same as usual.
But unlike tools that work only with VMware, IBM's storage hypervisor works with a variety of server virtualization technologies, including Microsoft Hyper-V, Xen, OracleVM, Linux KVM, PowerVM, z/VM and PR/SM. This is important, as a recent poll on the Hot Aisle blog indicates that [44 percent run 2 or more server hypervisors]!
Join the conversation! The virtual dialogue on this topic will continue in a [live group chat] this Friday, September 23, 2011 from 12 noon to 1pm EDT. Join me and about 20 other top storage bloggers, key industry analysts and IBM Storage subject matter experts to discuss storage hypervisors and get questions answered about improving your private storage environment.
Last week, fellow IBMer Ron Riffe started his three-part series on the Storage Hypervisor. I discussed Part I already in my previous post [Storage Hypervisor Integration with VMware]. We wrapped up the week with a Live Chat with over 30 IT managers, industry analysts, independent bloggers, and IBM storage experts.
"The idea of shopping from a catalog isn’t new and the cost efficiency it offers to the supplier isn’t new either. Public storage cloud service providers seized on the catalog idea quickly as both a means of providing a clear description of available services to their clients, and of controlling costs. Here’s the idea… I can go to a public cloud storage provider like Amazon S3, Nirvanix, Google Storage for Developers, or any of a host of other providers, give them my credit card, and get some storage capacity. Now, the “kind” of storage capacity I get depends on the service level I choose from their catalog.
Most of today’s private IT environments represent the complete other end of the pendulum swing – total customization. Every application owner, every business unit, every department wants to have complete flexibility to customize their storage services in any way they want. This expectation is one of the reasons so many private IT environments have such a heavy mix of tier-1 storage. Since there is no structure around the kind of requests that are coming in, the only way to be prepared is to have a disk array that could service anything that shows up. Not very efficient… There has to be a middle ground.
Private storage clouds are a little different. Administrators we talk to aren’t generally ready to let all their application owners and departments have the freedom to provision new storage on their own without any control. In most cases, new capacity requests still need to stop off at the IT administration group. But once the request gets there, life for the IT administrator is sweet!
Here comes the request from an application owner for 500GB of new “Database” capacity (one of the options available in the storage service catalog) to be attached to some server. After appropriate approvals, the administrator can simply enter the three important pieces of information (type of storage = “Database”, quantity = 500GB, name of the system authorized to access the storage) and click the “Go” button (in TPC SE it’s actually a “Run now” button) to automatically provision and attach the storage. No more complicated checklists or time consuming manual procedures.
A storage hypervisor increases the utilization of storage resources, and optimizes what is most scarce in your environment. For Linux, UNIX and Windows servers, you typically see utilization rates of 20 to 35 percent, and this can be raised to 55 to 80 percent with a storage hypervisor. But what is most scarce in your environment? Time! In a competitive world, it is not big animals eating smaller ones as much as fast ones eating the slow.
Want faster time-to-market? A storage hypervisor can help reduce the time it takes to provision storage, from weeks down to minutes. If your business needs to react quickly to changes in the marketplace, you certainly don't want your IT infrastructure to slow you down like a boat anchor.
Want more time with your friends and family? A storage hypervisor can migrate the data non-disruptively, during the week, during the day, during normal operating hours, instead of scheduling down-time on an evenings and weekends. As companies adopt a 24-by-7 approach to operations, there are fewer and fewer opportunities in the year for scheduled outages. Some companies get stuck paying maintenance after their warranty expires, because they were not able to move the data off in time.
Want to take advantage of the new Solid-State Drives? Most admins don't have time to figure out what applications, workloads or indexes would best benefit from this new technology? Let your storage hypervisor automated tiering do this for you! In fact, a storage hypervisor can gather enough performance and usage statistics to determine the characteristics of your workload in advance, so that you can predict whether solid-state drives are right for you, and how much benefit you would get from them.
Want more time spent on strategic projects? A storage hypervisor allows any server to connect to any storage. This eliminates the time wasted to determine when and how, and let's you focus on the what and why of your more strategic transformational projects.
If this sounds all too familiar, it is similar to the benefits that one gets from a server hypervisor -- better utilization of CPU resources, optimizing the management and administration time, with the agility and flexibility to deploy new technologies in and decommission older ones out.
"Server virtualization is a fairly easy concept to understand: Add a layer of software that allows processing capability to work across multiple operating environments. It drives both efficiency and performance because it puts to good use resources that would otherwise sit idle.
Storage virtualization is a different animal. It doesn't free up capacity that you didn't know you had. Rather, it allows existing storage resources to be combined and reconfigured to more closely match shifting data requirements. It's a subtle distinction, but one that makes a lot of difference between what many enterprises expect to gain from the technology and what it actually delivers."
Jon Toigo on his DrunkenData blog brings back the sanity with his post [Once More Into the Fray]. Here is an excerpt:
"What enables me to turn off certain value-add functionality is that it is smarter and more efficient to do these functions at a storage hypervisor layer, where services can be deployed and made available to all disk, not to just one stand bearing a vendor’s three letter acronym on its bezel. Doesn’t that make sense?
I think of an abstraction layer. We abstract away software components from commodity hardware components so that we can be more flexible in the delivery of services provided by software rather than isolating their functionality on specific hardware boxes. The latter creates islands of functionality, increasing the number of widgets that must be managed and requiring the constant inflation of the labor force required to manage an ever expanding kit. This is true for servers, for networks and for storage.
Can we please get past the BS discussion of what qualifies as a hypervisor in some guy’s opinion and instead focus on how we are going to deal with the reality of cutting budgets by 20% while increasing service levels by 10%. That, my friends, is the real challenge of our times."
Did you miss out on last Friday's Live Chat? We are doing it again this Friday, covering parts I and II of Ron's posts, so please join the conversation! The virtual dialogue on this topic will continue in another [Live Chat] on September 30, 2011 from 12 noon to 1pm Eastern Time.
Wrapping up my week's theme of storage optimization, I thought I would help clarify the confusion between data reduction and storage efficiency. I have seen many articles and blog posts that either use these two terms interchangeably, as if they were synonyms for each other, or as if one is merely a subset of the other.
Data Reduction is LOSSY
By "Lossy", I mean that reducing data is an irreversible process. Details are lost, but insight is gained. In his paper, [Data Reduction Techniques", Rajana Agarwal defines this simply:
"Data reduction techniques are applied where the goal is to aggregate or amalgamate the information contained in large data sets into manageable (smaller) information nuggets."
Data reduction has been around since the 18th century.
Take for example this histogram from [SearchSoftwareQuality.com]. We have reduced ninety individual student scores, and reduced them down to just five numbers, the counts in each range. This can provide for easier comprehension and comparison with other distributions.
The process is lossy. I cannot determine or re-create an individual student's score from these five histogram values.
This next example, complements of [Michael Hardy], represents another form of data reduction known as ["linear regression analysis"]. The idea is to take a large set of data points between two variables, the x axis along the horizontal and the y axis along the vertical, and find the best line that fits. Thus the data is reduced from many points to just two, slope(a) and intercept(b), resulting in an equation of y=ax+b.
The process is lossy. I cannot determine or re-create any original data point from this slope and intercept equation.
In this last example, from [Yahoo Finance], reduces millions of stock trades to a single point per day, typically closing price, to show the overall growth trend over the course of the past year.
The process is lossy. Even if I knew the low, high and closing price of a particular stock on a particular day, I would not be able to determine or re-create the actual price paid for individual trades that occurred.
Storage Efficiency is LOSSLESS
By contrast, there are many IT methods that can be used to store data in ways that are more efficient, without losing any of the fine detail. Here are some examples:
Thin Provisioning: Instead of storing 30GB of data on 100GB of disk capacity, you store it on 30GB of capacity. All of the data is still there, just none of the wasteful empty space.
Space-efficient Copy: Instead of copying every block of data from source to destination, you copy over only those blocks that have changed since the copy began. The blocks not copied are still available on the source volume, so there is no need to duplicate this data.
Archiving and Space Management: Data can be moved out of production databases and stored elsewhere on disk or tape. Enough XML metadata is carried along so that there is no loss in the fine detail of what each row and column represent.
Data Deduplication: The idea is simple. Find large chunks of data that contain the same exact information as an existing chunk already stored, and merely set a pointer to avoid storing the duplicate copy. This can be done in-line as data is written, or as a post-process task when things are otherwise slow and idle.
When data deduplication first came out, some lawyers were concerned that this was a "lossy" approach, that somehow documents were coming back without some of their original contents. How else can you explain storing 25PB of data on only 1PB of disk?
(In some countries, companies must retain data in their original file formats, as there is concern that converting business documents to PDF or HTML would lose some critical "metadata" information such as modificatoin dates, authorship information, underlying formulae, and so on.)
Well, the concern applies only to those data deduplication methods that calculate a hash code or fingerprint, such as EMC Centera or EMC Data Domain. If the hash code of new incoming data matches the hash code of existing data, then the new data is discarded and assumed to be identical. This is rare, and I have only read of a few occurrences of unique data being discarded in the past five years. To ensure full integrity, IBM ProtecTIER data deduplication solution and IBM N series disk systems chose instead to do full byte-for-byte comparisons.
Compression: There are both lossy and lossless compression techniques. The lossless Lempel-Ziv algorithm is the basis for LTO-DC algorithm used in IBM's Linear Tape Open [LTO] tape drives, the Streaming Lossless Data Compression (SLDC) algorithm used in IBM's [Enterprise-class TS1130] tape drives, and the Adaptive Lossless Data Compression (ALDC) used by the IBM Information Archive for its disk pool collections.
Last month, IBM announced that it was [acquiring Storwize. It's Random Access Compression Engine (RACE) is also a lossless compression algorithm based on Lempel-Ziv. As servers write files, Storwize compresses those files and passes them on to the destination NAS device. When files are read back, Storwize retrieves and decompresses the data back to its original form.
As with tape, the savings from compression can vary, typically from 20 to 80 percent. In other words, 10TB of primary data could take up from 2TB to 8TB of physical space. To estimate what savings you might achieve for your mix of data types, try out the free [Storwize Predictive Modeling Tool].
So why am I making a distinction on terminology here?
Data reduction is already a well-known concept among specific industries, like High-Performance Computing (HPC) and Business Analytics. IBM has the largest marketshare in supercomputers that do data reduction for all kinds of use cases, for scientific research, weather prediction, financial projections, and decision support systems. IBM has also recently acquired a lot of companies related to Business Analytics, such as Cognos, SPSS, CoreMetrics and Unica Corp. These use data reduction on large amounts of business and marketing data to help drive new sources of revenues, provide insight for new products and services, create more focused advertising campaigns, and help understand the marketplace better.
There are certainly enough methods of reducing the quantity of storage capacity consumed, like thin provisioning, data deduplication and compression, to warrant an "umbrella term" that refers to all of them generically. I would prefer we do not "overload" the existing phrase "data reduction" but rather come up with a new phrase, such as "storage efficiency" or "capacity optimization" to refer to this category of features.
IBM is certainly quite involved in both data reduction as well as storage efficiency. If any of my readers can suggest a better phrase, please comment below.
IBM is doing a bit of year-end housekeeping. The Storage Community (storagecommunity.org) will be discontinued as of January 1, 2017.
IBM will continue to host a community for all of its followers and contributors to share insights on the latest trends in storage at [ibm.co/StorageSolutions].
All of the most recent IBM content from storagecommunity.org will now be available at this new domain. IBM hopes that you will continue to engage in its community of storage industry thought leaders.
If you would like to contribute to the new community, please [register here]. Simply click the silhouette icon in the top right-hand corner of the page and select "register." Input your email address and create a password, then sign in. You will receive an email from IBM with further instructions to get you set up.
IBM's twitter handle (@SmarterStorage) will also be sunset as of January 1, 2017, but I encourage you to follow @IBMStorage, or my own twitter handle @az990tony, for the latest storage news and announcements from IBM.
Rich Bourdeau has written a nice article on InfoStor titled [Software as a Service (SaaS) meets Storage]. Last year, IBM acquired Arsenal Digital, and he mentions both in this article.It is interesting how this has evolved over the years.
Rent warehouse space for tapes
I remember when various companies offered remote storage for tapes. These would be temperature and humidity-controlledrooms, with access lists on who could bring tapes in, who could take tapes out, and so on. In the event of thedisaster, someone would collect the appropriate tapes and take them to a recovery site location.
Rent online/nearline storage from a Storage Service Provider (SSP)
SSPs rented storage space on disk, or provided automated tape libraries that could be written to. With tapes being ejected and stored in temperature/humidity-controlled vaults. Electronic vaulting eliminates a lot of theissues with cartridge handling and transportation, is more secure, and faster. Rented disk space, based on a Gigabytes-per-month rate, could be used for whatever the customer wanted. If these were for backups or archive,then the customer has to have their own software, to do their own processing at their own location, sending the data to the remote storage as appropriate, and manage their own administration.
Backup-as-a-Service and Archive-as-a-Service
We are now seeing the SaaS model applied to mundane and routine storage management tasks. New providers can offerthe software to send backups, the disk to write them to, and as needed the tape libraries and cartridges to rollover when the disk space is full. Disk capacity can be sized so that the most recent backups are on immediately accessible for fast recovery.
The same concept can be applied to archives. The key difference between a backup and an archive is that backups areversion-based. You might keep three versions of a backup, the most recent, and two older copies, in case something is wrong with the most recent copy, you can go back to older copies. This could be from undetected corruption of the data itself, or problems with the disk or tape media. An archive, on the other hand, is time-based. You want this data to be kept for a specific period of time, based on an event or fixed period of years.
Since BaaS and AaaS providers know what the data is, have some idea of the policies and usage patterns will be, can then optimize a storage solution that best meets service level agreements.
Continuing this week's theme on Storage Area Networks, today I thought I would talkabout the various terms we use for our equipment.
One area of confusion are the adjectives "entry-level", "midrange" and "enterprise-class".What do these mean? Well, as in the case of disk and tape, these three are all relative terms that are a combination of "small, medium, large" as well as "good, better, best".
Entry-level switches are typically only a maximum of 8-16 ports.Ports can connect the switch to a server, a storage device, or another switch.These are sometimes called "edge" switches, as they might be found in the mostremote sections of an office campus, remote branches, or other isolated areasoutside the primary data center.
Midrange switches typically have a maximum of 32-64 ports.More ports on a single switch means fewer switches (and fewer cables) to manage.
These are called "directors" to distinguish them from entry-level and midrange offerings.Directors have a maximum of 140-528 ports, and because so many devices or switches can beconnected to them, they need to be extremely reliable. Directors are designed for 24x7operation, with the ability to make most upgrades and configuration changes while the boxis running (often referred to as "non-disruptive upgrades"). Availability is typically better than "five nines", or 99.999 percent, which means that the box will be up 99.999 percent of the time, or conversely, will be down lessthan 5 minutes per year.
If you are asking yourself "which size is right for my company?" or "is my company big enoughfor a director?" you are asking the wrong questions! Instead, determine a SAN configurationthat meets your workload, and then decide the components for that design.
McData coined a phrase called "core/edge" design that is considered today as "Best Practice" throughout the industry.A good write-up can be found here at SearchStorage.com. Basically, you put your big beefy "core" directors in the center of the room, and then surround it with midrange switches, that then these connect to "edge" switches, that then connect to the servers and storage near them. As you grow, this design can easilyscale to grow with you.
So, if you need help implementing a SAN for the first time, or upgrading the one you have,call IBM, we can help!
I would fall into the "not for me" category, at least at this time. The iPhone is GSM-capable phone with the ability to store 4GB or 8GB of music, photos and video, and has incorporated a 2 megapixel camera. Currently, I have separate components:
A cell phone that is GSM plus CDMA, with features like "speakerphone" which I use quite a lot, but NO camera.
A 7 megapixel camera, also very small, with removable memory cards.
A 60GB iPod, with music and photos. My model is older and doesn't handle videos.
Since I visit government agencies, research and development labs, and other places that don't allow cameras, I have to either chose a cell phone that does not have camera capability in it, or have a camera phone that I leave behind in the car or at the front desk. I have chosen to get cell phones with NO camera. So, NOT having a camera is a primary feature I look for, but this is getting harder and harder these days. I don't know if Apple plans to have a non-camera version of their iPhone, but that would be a deal-breaker for me.
I do carry a separate camera, and where it is permissible, use it separately. This is especially useful if you do a lot of whiteboard or flipchart presentations, and want to capture what you have written for later. (For a great example of how effectively whiteboards can be used, check out these videos from UPS.)A picture is worth a thousand words, and is easier to convey an idea with pictures, especially in countries that may not speak English. Last month, I got a 7 megapixel camera to replace my 5 megapixel. For my work, 2 megapixel as found in the iPhone is not detailed enough.
As for my iPod, I enjoy that I can carry 60GB of music and photos. When I go on vacations, I can bring my camera and iPod, and connect the two, transferring and viewing the pictures that I take. I can easily free up 5-10 GB of space on my iPod for photos in preparation for a trip, then replace that with music when I am back at home. I also use my iPod as a remote disk drive for my laptop on business trips. Again, the 4GB and 8GB may not be enough for what I need.
Printers were never converged into Personal Computers, but they did have their own convergence. I have a multi-function printer/scanner/fax machine. I used to have separate printer, scanner and fax machines, but now the technology is so inexpensive that it got all combined into one solution.
The same is happening for Storage Area Networking gear.
Thanks to Fibre Channel, switches and directors can handle both SCSI commands (FCP) and CCW commands (FICON). This allows the mainframe and distrbuted systems to converge their traffic onto a single network, and is less expensive than trying to maintain one network for the mainframes, and another for the distributed platforms.
On the SCSI side, there are now switches that let you have pluggable ports of different flavors. For example, you can have some ports be Fibre Channel to receive FCP, and other ports to be Ethernet to carry iSCSI. iSCSI is a protocol co-developed between IBM and Cisco to carry SCSI commands over Ethernet. Since most computers already have Ethernet "network interface cards" and most buildings are already wired with an Ethernet infrastructure, this provides a less expensive alternative to Fibre Channel.
Routers, and combination Router/Switches, can send all the FCP/FICON/iSCSI traffic over various long distances to remote data centers, using either iFCP or FCIP protocols. This is a less expensive alternative to dropping your own private "dark fiber" between the two locations, which often involves negotiating access rights to dig trenches through other people's property.
Which brings me back to Apple's iPhone. One device can make calls, watch video, and download webpages all because the networks have converged into sending all data in "packets". The network just routes packets from one place to another. It doesn't care that a packet is a voice packet, a video packet or a webpage packet. It doesn't matter.
"Users can pay for groceries and other purchases by swiping a phone over a reader that electronically communicates with a microchip on the phone. Phone owners confirm the purchase with the push of a button and the deal is complete.
The platform is the result of many years of trials around the world and will enable mobile contactless payments, remote payments, person-to-person payments, and mobile coupons."
Continuing on my theme of storage area networking, today I thought I would coverstorage networking at home.
Before the PC, corporate end-users had dumb terminals (displays) connected to mainframes (servers) thatwere then connected to external disk and tape (storage devices). This was all done with direct cable connections,then later through networks. The PC solved this by putting the display, server and storage into one unit, makingit more accessible to smaller businesses and individuals.
Many years ago, Microsoft started out with the vision "A PC on every desktop".The primary reason we even have networks is while everyone might have had their own PC, not everyone had their own printer. (Printers used to be part of IBM's storage division, which we explained as "storage on PAPER"!)Maybe if Microsoft's vision was "A PC and printer on every desktop", history might have turned up different.
Disclaimer: IBM has close business relationships with both Apple and Microsoft and others,providing the chips inside some of their products. I discuss them here not only becauseI am trying to get you to buy their products, and let IBM benefit indirectly from their success, but because they are newsworthy, and relevant to the topic at hand.
The "Apple TV" is not a TV at all, but rather a server, one that lets your television (your dumb terminal)access the video, audio and photos stored on your Mac or iPod (the storage device), all through a home network.(Sound familiar?)
Bill Gates from Microsoft gave the keynote, and this is probably his last appearance, as he is retiring in 2008,as we are reminded by thisfunny video, to move on to bigger, and better things. It is perhaps fitting that his retirement aligns with the end of the era for the PC.
Microsoft unveiled their Microsoft Windows Home Server, again a server that connects your television (dumb terminal) with your PC or Zune (storage device)all through your home network. (Sound familiar, again?)
Whereas Apple above pretty much shunned the gaming community, Microsoft embraced it with their internet-enabled Xbox360.Microsoft sold 10.4 million of these last year, which was 400 million more than they projected.
Our SAN technology partner Cisco wants to get in on this "home networking craze", as written about inInfoWorld andCnet.
My take on all this...the consumer electronics industry is taking clues from IBM's mainframe business. Not the first time this has happened, and probably not the last.
I already access photos and audio with my Tivo, from both my Mac AND my PC,so not much new here for me. Getting my home network connected was one of mytech highlights of 2006 and organizing my audio content was done withILM for my iPod.
Bypassing the PC, by being able to have your television, handheld or phone access data directly will greatlyincrease the demand for storage from businesses that provide information and content, and for storage networking technology in the home. It will be interesting how this all plays out in 2007.