What Is Data Deduplication?

Authors

Staff Writer

IBM Think

Staff Editor

IBM Think

What is data deduplication?

Data deduplication is a streamlining process in which redundant data is reduced by eliminating extra copies of the same information. The goal of data deduplication, or “dedupe” as it’s commonly shortened, is to lessen an organization’s ongoing storage needs.

Organizations can implement data deduplication processes and techniques to make sure that only one, unique instance of data exists within their storage system. Duplicate or redundant data are removed and users are pointed to a single instance of the data.

When data deduplication is successful, it can improve an organization’s overall storage utilization and help recude costs.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Why is data deduplication needed?

So, why would a company create duplicate data anyway? There could be one or more of any number of valid reasons, including the following:

An organization or one of its departments may need to repurpose original data, so new data copies are created.
A company might want to retain duplicate copies as part of its backup system in case of a data-loss event.
An organization could find it has kept multiple copies of the same data but stored in different formats.

Another key reason for data duplication is simply because that’s what often occurs in most multidepartment organizations. Data is either regularly created or re-created as an accepted and organic function of doing business within a modern context. Therefore, either data creation or replication is not the actual problem; excessive data proliferation is.

Were there no extra financial burdens that are associated with it, data proliferation might seem to be less of a problem than it is. An organization could opt to store data at various locations within its IT architecture and not care about those redundancies.

But the fact is that a company does incur financial penalties by maintaining a large number of data redundancies in the form of extra storage costs. Organizations that can’t stop creating data redundancies need to allocate more labor and budget to implement new storage solutions and data management, be they based on new hardware purchases or added cloud storage.

IBM Storage FlashSystem

IBM Storage FlashSystem: Optimizing VMware for Cost, Simplicity and Resilience

Discover how IBM FlashSystem optimizes VMware environments for cost efficiency, simplicity, and resilience. This session highlights how FlashSystem can enhance data safety, accessibility, and performance, making it an ideal solution for modern IT infrastructures.

Explore IBM Storage FlashSystem

Benefits of data deduplication

The most obvious benefit of data deduplication techniques is that weeding out extraneous data lessens the total amount of data that an organization must store and manage. This effectively increases an organization’s storage capacity by having less data to occupy storage space.

Aside from reduced storage costs, data deduplication offers other key advantages, like furthering data backup plans and supporting emergency steps to safeguard disaster recovery.

Another plus is revitalizing data integrity by removing “deadweight” data and making sure that the remaining data has been properly cleansed. Deduplicated data were shown to both run better and consume less energy.

Another benefit of data deduplication is how well it works with virtual desktop infrastructure (VDI) deployments, thanks to the fact that the virtual hard disks behind the VDI’s remote desktops operate identically. Popular Desktop as a Service (DaaS) products include Azure Virtual Desktop, from Microsoft and its Windows VDI. These products make virtual machines (VMs), which are created during the server virtualization process. In turn, these virtual machines empower the VDI technology.

How does data deduplication work?

At its most basic level, data deduplication operates through automated functions to identify duplications in data blocks and then remove those duplications. By working at this block level, chunks of unique data can be analyzed and specified as being worthy of preservation. Then, when the deduplication software detects a repetition of the same data block, that repetition is removed and a reference to the original data is included in its place.

An alternate method of data deduplication operates at the file level. Single instance data storage compares full copies of data within the file system, but not chunks or blocks of data. Like its counterpart method, file deduplication depends upon keeping the original file and removing extra copies.

Deduplication techniques do not work in quite the same manner as data compression algorithms (for example, LZ77, LZ78), although it’s true that both pursue the same general goal of reducing data redundancies. Deduplication techniques achieve this on a larger, macro scale than compression algorithms, whose goal is less about replacing identical files with shared copies and more about efficiently encoding data redundancies.

Types of data deduplication

There are two basic types of data deduplication that depend on when the processes occur.

Inline deduplication

This form of data deduplication occurs in real-time as data flows within the system. The system carries less data traffic because it neither transfers nor stores duplicated data. This can lead to a reduction in the total amount of bandwidth needed by that organization.

Post-process deduplication

This type of deduplication takes place after data has been written and placed on some type of storage device.

Both types of data deduplication are affected by the hash calculations inherent to data deduplication. These cryptographic calculations are integral to identifying repeated patterns in data. During inline deduplication, those calculations are performed at the moment, which can dominate and temporarily overwhelm computer functionality. In post-processing deduplications, the hash calculations can be performed at any time after the data is added.

The subtle differences between deduplication types don’t end there. A second way to classify deduplication types is based on where such processes occur.

Source deduplication

This form of deduplication takes place near where new data is generated. The system scans that area and detects new copies of files, which are then removed.

Target deduplication

Target deduplication is basically an inversion of source deduplication. In target deduplication, the system deduplicates any copies that are found in areas other than where the original data was created.

Because there are different types of deduplication methods that are practiced, forward-leaning organizations must make careful and considered decisions regarding the type of deduplication they choose, balancing that method against that company’s particular needs.

In many use cases, an organization’s deduplication method of choice may very well come down to various internal variables, such as the following:

How many and what type of data sets are being created
The organization’s primary storage system
Which virtual environments are in use
Which apps the company relies upon

Resilience reinvented: Building a future-ready, AI-driven cyber-resilience strategy

Learn how to strengthen your cyber-resilience with an AI-powered, integrated strategy that improves threat detection, reduces risk and ensures continuity in the face of evolving security and regulatory challenges.

What is data deduplication?

Authors

What is data deduplication?

The latest AI News + Insights

Why is data deduplication needed?

IBM Storage FlashSystem: Optimizing VMware for Cost, Simplicity and Resilience

Benefits of data deduplication

How does data deduplication work?

Types of data deduplication

Resources

The latest AI News + Insights