Client-side data deduplication

Data deduplication is a method of reducing storage needs by eliminating redundant data.

Overview

Two types of data deduplication are available: client-side data deduplication and server-side data deduplication.

Client-side data deduplication is a data deduplication technique that is used on the backup-archive client to remove redundant data during backup and archive processing before the data is transferred to the IBM Spectrum Protect™ server. Using client-side data deduplication can reduce the amount of data that is sent over a local area network.

Server-side data deduplication is a data deduplication technique that is done by the server. The IBM Spectrum Protect administrator can specify the data deduplication location (client or server) to use with the DEDUP parameter on the REGISTER NODE or UPDATE NODE server command.

Enhancements

With client-side data deduplication, you can:
  • Exclude specific files on a client from data deduplication.
  • Enable a data deduplication cache that reduces network traffic between the client and the server. The cache contains extents that were sent to the server in previous incremental backup operations. Instead of querying the server for the existence of an extent, the client queries its cache.

    Specify a size and location for a client cache. If an inconsistency between the server and the local cache is detected, the local cache is removed and repopulated.

    Note: For applications that use the IBM Spectrum Protect API, the data deduplication cache must not be used because of the potential for backup failures caused by the cache being out of sync with the IBM Spectrum Protect server. If multiple, concurrent backup-archive client sessions are configured, there must be a separate cache configured for each session.
  • Enable both client-side data deduplication and compression to reduce the amount of data that is stored by the server. Each extent is compressed before it is sent to the server. The trade-off is between storage savings and the processing power that is required to compress client data. In general, if you compress and deduplicate data on the client system, you are using approximately twice as much processing power as data deduplication alone.

    The server can work with deduplicated, compressed data. In addition, backup-archive clients earlier than V6.2 can restore deduplicated, compressed data.

Client-side data deduplication uses the following process:

  • The client creates extents. Extents are parts of files that are compared with other file extents to identify duplicates.
  • The client and server work together to identify duplicate extents. The client sends non-duplicate extents to the server.
  • Subsequent client data-deduplication operations create new extents. Some or all of those extents might match the extents that were created in previous data-deduplication operations and sent to the server. Matching extents are not sent to the server again.

Benefits

Client-side data deduplication provides several advantages:

  • It can reduce the amount of data that is sent over the local area network (LAN).
  • The processing power that is required to identify duplicate data is offloaded from the server to client nodes. Server-side data deduplication is always enabled for deduplication-enabled storage pools. However, files that are in the deduplication-enabled storage pools and that were deduplicated by the client, do not require additional processing.
  • The processing power that is required to remove duplicate data on the server is eliminated, allowing space savings on the server to occur immediately.

Client-side data deduplication has a possible disadvantage. The server does not have whole copies of client files until you back up the primary storage pools that contain client extents to a non-deduplicated copy storage pool. (Extents are parts of a file that are created during the data-deduplication process.) During storage pool backup to a non-deduplicated storage pool, client extents are reassembled into contiguous files.

By default, primary sequential-access storage pools that are set up for data deduplication must be backed up to non-deduplicated copy storage pools before they can be reclaimed and before duplicate data can be removed. The default ensures that the server has copies of whole files at all times, in either a primary storage pool or a copy storage pool.

Important: For further data reduction, you can enable client-side data deduplication and compression together. Each extent is compressed before it is sent to the server. Compression saves space, but it increases the processing time on the client workstation.

In a data deduplication-enabled storage pool (file pool) only one instance of a data extent is retained. Other instances of the same data extent are replaced with a pointer to the retained instance.

When client-side data deduplication is enabled, and the server has run out of storage in the destination pool, but there is a next pool defined, the server will stop the transaction. The backup-archive client retries the transaction without client-side data deduplication. To recover, the IBM Spectrum Protect administrator must add more scratch volumes to the original file pool, or retry the operation with deduplication disabled.

For client-side data deduplication, the IBM Spectrum Protect server must be Version 6.2 or higher.

Prerequisites

When configuring client-side data deduplication, the following requirements must be met:

  • The client and server must be at version 6.2.0 or later. The latest maintenance version should always be used.
  • When a client backs up or archives a file, the data is written to the primary storage pool that is specified by the copy group of the management class that is bound to the data. To deduplicate the client data, the primary storage pool must be a sequential-access disk (FILE) storage pool that is enabled for data deduplication.
  • The value of the DEDUPLICATION option on the client must be set to YES. You can set the DEDUPLICATION option in the client options file, in the preference editor of the backup-archive client GUI, or in the client option set on the IBM Spectrum Protect server. Use the DEFINE CLIENTOPT command to set the DEDUPLICATION option in a client option set. To prevent the client from overriding the value in the client option set, specify FORCE=YES.
  • Client-side data deduplication must be enabled on the server. To enable client-side data deduplication, use the DEDUPLICATION parameter on the REGISTER NODE or UPDATE NODE server command. Set the value of the parameter to CLIENTORSERVER.
  • Ensure files on the client are not excluded from client-side data deduplication processing. By default, all files are included. You can optionally exclude specific files from client-side data deduplication with the exclude.dedup client option.
  • Files on the client must not be encrypted. Encrypted files and files from encrypted file systems cannot be deduplicated.
  • Files must be larger than 2 KB and transactions must be below the value that is specified by the CLIENTDEDUPTXNLIMIT option. Files that are 2 KB or smaller are not deduplicated.

The server can limit the maximum transaction size for data deduplication by setting the CLIENTDEDUPTXNLIMIT option on the server. For more information about this option, see the IBM Spectrum Protect server documentation.

The following operations take precedence over client-side data deduplication:

  • LAN-free data movement
  • Simultaneous-write operations
  • Data encryption
Important: Do not schedule or enable any of those operations during client-side data deduplication. If any of those operations occur during client-side data deduplication, client-side data deduplication is turned off, and a message is written to the error log.

The setting on the server ultimately determines whether client-side data deduplication is enabled. See Table 1.

Table 1. Data deduplication settings: Client and server
Value of the client DEDUPLICATION option Setting on the server Data deduplication location
Yes On either the server or the client Client
Yes On the server only Server
No On either the server or the client Server
No On the server only Server

Encrypted files

The IBM Spectrum Protect server and the backup-archive client cannot deduplicate encrypted files. If an encrypted file is encountered during data deduplication processing, the file is not deduplicated, and a message is logged.
Tip: You do not have to process encrypted files separately from files that are eligible for client-side data deduplication. Both types of files can be processed in the same operation. However, they are sent to the server in different transactions.
As a security precaution, you can take one or more of the following steps:
  • Enable storage-device encryption together with client-side data deduplication.
  • Use client-side data deduplication only for nodes that are secure.
  • If you are uncertain about network security, enable Secure Sockets Layer (SSL).
  • If you do not want certain objects (for example, image objects) to be processed by client-side data deduplication, you can exclude them on the client. If an object is excluded from client-side data deduplication and it is sent to a storage pool that is set up for data deduplication, the object is deduplicated on server.
  • Use the SET DEDUPVERIFICATIONLEVEL command to detect possible security attacks on the server during client-side data deduplication. Using this command, you can specify a percentage of client extents for the server to verify. If the server detects a possible security attack, a message is displayed.