Overcome your dependency on unreliable FTP and free up valuable IT resources
JeffHebert 060001UEQ2 Tags:  management paas virtualization iaas information storage cloud data saas 1,809 Visits
JeffHebert 060001UEQ2 Tags:  enterprise compression storage time san nas de-duplication real 1,354 Visits
Originating Author: David Vellante
Co-author: David Floyer
Tip: ctrl +/- to increase/decrease text size
There has been significant discussion in the industry about storage optimization and making better use of storage capacity. A number of storage vendors have successfully marketed data de-duplication for offline/backup applications, reducing the volume of backup data by a factor of 5-15:1, according to Wikibon user input.
Data de-duplication as applied to backup use cases is different from compression in that compression actually changes the data using algorithms to create a computational byproduct and write fewer bits. With de-duplication, data is not changed, rather copies 2-N are deleted and pointers are inserted to a 'master' instance of the data. Single-instancing can be thought of as synonymous with de-duplication.
Traditional data de-duplication technologies however are generally unsuitable for online or primary storage applications because the overheads associated with the algorithms required to de-duplicate data will unacceptably elongate response times. As an example, popular data de-duplication solutions such as those from Data Domain, ProtecTier (Diligent/IBM), Falconstor and EMC/Avamar are not used for reducing capacities of online storage.
There are three primary approaches to optimizing online storage, reducing capacity requirements and improving overall storage efficiencies. Generally, Wikibon refers to these in the broad category of on-line or primary data compression, although the industry will often use terms like de-duplication (e.g. NetApp A-SIS) and single instancing. These data reduction technologies are illustrated by the following types of solutions:
Unlike some data reduction solutions for backup, these three approaches use lossless data compression algorithms, meaning mathematically, bits can always be reconstructed.
Each of these approaches has certain benefits and drawbacks. The obvious benefit is reduced storage costs. However each solution places another technology layer in the network and increases complexity and risk.
Array-based data reduction
Array-based data reduction technologies such as A-SIS operate in-line as data is being written to reduce primary storage capacity. The de-duplication feature of WAFL (NetApp’s Write Anywhere File Layout) allows the identification of duplicates of a 4K block at write time (creating a weak 32-bit digital signature of the 4K block, which is then compared bit-by-bit to ensure that there is no hash collision) and placed into a signature file in the metadata. The work of identifying the duplicates is similar to the snap technology and is done in the background if controller resources are sufficient. The default is once every 24 hours and every time the percentage of changes reaches 20%.
In addition, there are three main disadvantages of an A-SIS solution, including:
IT Managers should note that A-SIS is included as a no-charge standard offering within NetApp's Nearline component of ONTAP, the company's storage OS.
Host-managed offline data compression solutions
Ocarina is an example of a host-managed data reduction offering or what it calls 'split-path.' It consists of an offline process that reads files through an appliance, compresses those files and writes them back to disk. When a file is requested, another appliance re-hydrates data and delivers it to the application. The advantage of this approach is much higher levels of compression because the process is offline and uses many more robust algorithms. A reasonable planning assumption of reduction ratios will range from 3-6:1 and sometimes higher for initial ingestion and read-only Web environments. However, because of the need to re-hydrate when new data is written, classical production environments may see lower ratios.
In the case of Ocarina, the company has developed proprietary algorithms that can improve reduction ratios on many existing file types (e.g. jpeg, pdf, mpeg, etc), which is unique in the industry.
The main drawbacks of host-managed data reduction solutions are:
On balance, solutions such as Ocarina are highly suitable and cost-effective for infrequently accessed data and read-intensive applications. High update environments should be avoided.
In-line data compression
IBM Real-time Compression offers in-line data compression whereby a device sits between servers and the storage network (see Shopzilla's architecture). Wikibon members indicate a compression ratio of 1.5-2:1 is a reasonable rule-of-thumb.
The main advantage of the IBM Real-time Compression approach is very low latency (i.e. microseconds) and improved performance. Storage performance is improved because compression occurs before data hits the storage network. As a result, all data in the storage network is compressed, meaning less data is sent through the SAN, cache, internal array, and disk devices, minimizing resource requirements and backup windows by 40% or more, according to Wikibon estimates.
There are two main drawbacks of the IBM Real-time Compression approach, including:
On balance, the advantages of an Ocarina or IBM Real-time Compression approach are they can be applied to any file-based storage (i.e. heterogeneous devices). NetApp and other array-based solutions lock customers into a particular storage vendor but have certain advantages as well. For example, they are simpler to implement because they are already integrated.
An Ocarina approach is best applied in read-intensive environments where it will achieve better reduction ratios due to its post-process/batch ingestion methodology. IBM Real-time Compression will achieve the highest levels of compression and ROI in general purpose enterprise data centers of 30TB's or greater.
Action Item: On-line data reduction is rapidly coming to mainstream storage devices in your neighborhood. Storage executives should familiarize themselves with the various technologies in this space and demand that storage vendors apply capacity optimization techniques to control storage costs.
Footnotes: RELATED RESEARCH