Co-author: David Floyer
Tip: ctrl +/- to increase/decrease text size
There has been significant discussion in the industry about
storage optimization and making better use of storage capacity. A number
of storage vendors have successfully marketed data de-duplication for offline/backup applications, reducing the volume of backup data by a factor of 5-15:1, according to Wikibon user input.
Data de-duplication as applied to backup use cases is different
from compression in that compression actually changes the data using
algorithms to create a computational byproduct and write fewer bits.
With de-duplication, data is not changed, rather copies 2-N are deleted
and pointers are inserted to a 'master' instance of the data.
Single-instancing can be thought of as synonymous with de-duplication.
Traditional data de-duplication technologies however are
generally unsuitable for online or primary storage applications because
the overheads associated with the algorithms required to de-duplicate
data will unacceptably elongate response times. As an example, popular
data de-duplication solutions such as those from Data Domain, ProtecTier
(Diligent/IBM), Falconstor and EMC/Avamar are not used for reducing
capacities of online storage.
There are three primary approaches to optimizing online storage,
reducing capacity requirements and improving overall storage
efficiencies. Generally, Wikibon refers to these in the broad category
of on-line or primary data compression, although the industry will often
use terms like de-duplication (e.g. NetApp A-SIS) and single
instancing. These data reduction technologies are illustrated by the
following types of solutions:
- NetApp A-SIS and EMC Celerra which employ either “data de-duplication light” or single-instance technology embedded into the storage array;
- Host-managed offline data reduction solutions such as Ocarina Networks;
- In-line data compression appliances available from IBM Real-time Compression.
Unlike some data reduction solutions for backup, these three approaches use lossless data compression algorithms, meaning mathematically, bits can always be reconstructed.
Each of these approaches has certain benefits and drawbacks. The
obvious benefit is reduced storage costs. However each solution places
another technology layer in the network and increases complexity and
risk.
Array-based data reduction
Array-based data reduction technologies such as A-SIS operate
in-line as data is being written to reduce primary storage capacity. The
de-duplication feature of WAFL (NetApp’s Write Anywhere File Layout)
allows the identification of duplicates of a 4K block at write time
(creating a weak 32-bit digital signature of the 4K block, which is then
compared bit-by-bit to ensure that there is no hash collision) and
placed into a signature file in the metadata. The work of identifying
the duplicates is similar to the snap technology and is done in the
background if controller resources are sufficient. The default is once
every 24 hours and every time the percentage of changes reaches 20%.
In addition, there are three main disadvantages of an A-SIS solution, including:
- With A-SIS, de-duplication can only occur within a single
flex-volume (not traditional volume), meaning candidate blocks must be
co-resident within the same volume to be eligible for comparison. The
deduplication is based on 4k fixed blocks, rather than the variable
block of (say) IBM/Diligent. This limits the de-duplication potential.
- There is a complicated set of constraints when A-SIS is used
together with different snaps depending on the level of software. Snaps
made before deduplication will overrule de-duplication candidacy in
order to maintain data integrity. This limits the space savings
potential of de-dupe. Specifically, NetApp's de-dupe is not cumulative
to space efficient snapshots. See (technical description);
- The performance overheads of deduplication as described above
mean that A-SIS should not be applied to a highly utilized controller
(where the most benefit is likely to be achieved);
- There is an overhead of for the metadata (up to 6%)
- To exploit this feature, users are locked-in to NetApp storage.
IT Managers should note that A-SIS is included as a no-charge
standard offering within NetApp's Nearline component of ONTAP, the
company's storage OS.
Host-managed offline data compression solutions
Ocarina
is an example of a host-managed data reduction offering or what it
calls 'split-path.' It consists of an offline process that reads files
through an appliance, compresses those files and writes them back to
disk. When a file is requested, another appliance re-hydrates data and
delivers it to the application. The advantage of this approach is much
higher levels of compression because the process is offline and uses
many more robust algorithms. A reasonable planning assumption of
reduction ratios will range from 3-6:1 and sometimes higher for initial
ingestion and read-only Web environments. However, because of the need
to re-hydrate when new data is written, classical production
environments may see lower ratios.
In the case of Ocarina, the company has developed proprietary
algorithms that can improve reduction ratios on many existing file types
(e.g. jpeg, pdf, mpeg, etc), which is unique in the industry.
The main drawbacks of host-managed data reduction solutions are:
- The expense of the solution is not insignificant due to
appliance and server costs needed to perform compression. In
infrequently accessed, read-only or write-light environments, these
costs will be justified.
- To achieve these benefits, all files must be ingested, which is
a slow process. Picking the right use cases will minimize this issue.
- After a file is read and modified, it is written back to disk
as uncompressed. To achieve savings, files must be re-compressed again
limiting use cases to infrequently accessed files.
- Ocarina currently supports only files, unlike NetApp A-SIS
which supports both file and block-based storage. However Ocarina's
implementation offers several advantages over A-SIS (remember A-SIS is
free).
- The solution is not highly scalable because the processes related to backup, re-hydration, and data movement are complicated.
On balance, solutions such as Ocarina are highly suitable and
cost-effective for infrequently accessed data and read-intensive
applications. High update environments should be avoided.
In-line data compression
IBM Real-time Compression offers in-line data compression whereby a device sits between servers and the storage network (see Shopzilla's architecture). Wikibon members indicate a compression ratio of 1.5-2:1 is a reasonable rule-of-thumb.
The main advantage of the IBM Real-time Compression approach is
very low latency (i.e. microseconds) and improved performance. Storage
performance is improved because compression occurs before data hits the
storage network. As a result, all data in the storage network is
compressed, meaning less data is sent through the SAN, cache, internal
array, and disk devices, minimizing resource requirements and backup
windows by 40% or more, according to Wikibon estimates.
There are two main drawbacks of the IBM Real-time Compression approach, including:
- Costs of appliances and network re-design to exploit the compression devices. The Wikibon community estimates clear ROI will be realized in shops with greater than 30TB's;
- Complexity of recovery, specifically users need to plan for
re-hydration of data when performing recovery of backed up files (i.e.
they need to have a Storewize engine or software present to recover from
a data loss).
On balance, the advantages of an Ocarina or IBM Real-time Compression
approach are they can be applied to any file-based storage (i.e.
heterogeneous devices). NetApp and other array-based solutions lock
customers into a particular storage vendor but have certain advantages
as well. For example, they are simpler to implement because they are
already integrated.
An Ocarina approach is best applied in read-intensive
environments where it will achieve better reduction ratios due to its
post-process/batch ingestion methodology. IBM Real-time Compression will
achieve the highest levels of compression and ROI in general purpose
enterprise data centers of 30TB's or greater.
Action Item: On-line data reduction is rapidly coming to
mainstream storage devices in your neighborhood. Storage executives
should familiarize themselves with the various technologies in this
space and demand that storage vendors apply capacity optimization
techniques to control storage costs.
Footnotes: RELATED RESEARCH