Tuning client-side data deduplication

The performance of client-side data deduplication can be affected by processor requirements and deduplication configuration.

About this task

Data deduplication is a method of reducing storage needs by eliminating redundant data. Client-side data deduplication is the process of removing the redundant data during a backup operation on the client system. Client-side data deduplication is especially effective when you want to conserve bandwidth between the IBM Spectrum Protect client and server.

Procedure

To help you enhance the performance of client-side data deduplication, take the following actions based on the task that you want to complete.
Table 1. Actions for tuning client-side data deduplication performance
Action Explanation
Ensure that the client system meets the minimum hardware requirements for client-side data deduplication. Before you decide to use client-side data deduplication, verify that the client system has adequate resources available during the backup window to run the deduplication processing.

The preferred minimum processor requirement is the equivalent of one 2.2 GHz processor core per backup process with client-side data deduplication. For example, a system with a single-socket, quad-core, 2.2-GHz processor that is used 75% or less during the backup window is a good candidate for client-side data deduplication.

Use a combination of deduplication and compression to obtain significant data reduction. When data is compressed after it is already deduplicated, it can give you more savings in data reduction as compared to running data deduplication alone. When data deduplication and compression are both enabled during a backup operation on the backup-archive client, the operations are sequenced in the preferred order (data deduplication followed by compression).
Avoid running client compression in combination with server-side data deduplication. When you use client compression in combination with server-side data deduplication, it is typically slower and reduces data volume less than the preferred alternatives of server-side data deduplication alone, or the combination of client-side data deduplication and client-side compression.
Increase the number of parallel sessions as an effective way to improve overall throughput when you are using client-side deduplication. This action applies to client systems that have sufficient processor resources, and when the client application is configured to perform parallel backups. For example, when you use IBM Spectrum Protect for Virtual Environments, it might be possible to use up to 30 parallel VMware backup sessions before a 1 Gb network becomes saturated. Rather than immediately configuring numerous parallel sessions to improve throughput, increment the number of sessions gradually, and stop when you no longer see improvements in throughput.

For information about optimizing parallel backups, see Optimizing parallel backups of virtual machines.

Configure the client data deduplication cache with the enablededupcache option. The client must query the server for each extent of data that is processed. You can reduce the processor usage that is associated with this query process by configuring the cache on the client. With the data deduplication cache, the client can identify previously discovered extents during a backup session without querying the IBM Spectrum Protect server.

The following guidelines apply when you configure the client data deduplication cache:

  • For the backup-archive client, including VMware virtual machine backups, always configure the cache for client-side data deduplication.
  • For IBM Spectrum Protect for Virtual Environments operations, if you configure multiple client sessions to back up a vStorage backup server, you must configure a separate cache for each session.
  • For networks with low latency that process a large amount of deduplicated data daily, disable the client deduplication cache for faster performance.
Restriction:
  • For applications that use the IBM Spectrum Protect API, do not use the client data deduplication cache because backup failures can occur if the cache becomes out of sync with the IBM Spectrum Protect server. This restriction applies to the IBM Spectrum Protect Data Protection applications. Do not configure the client data deduplication cache when you are using the data protection products.
  • If you use image backups, do not configure the client data deduplication cache.
Decide whether to use client-side data deduplication or server-side data deduplication. Whether you choose to use client-side data deduplication depends on your system environment. In a network-constrained environment, you can run data deduplication on the client to improve the elapsed time for backup operations. If the environment is not network-constrained and you run data deduplication on the client, it can result in longer elapsed backup times.
To evaluate whether to use client-side data or server-side data deduplication, see the information in Table 2.

Use the following checklist to help you choose whether to implement client-side or server-side data deduplication.

Table 2. Checklist for choosing client-side versus server-side data deduplication
Question Response
Does the speed of your backup network result in long backup times?
Yes
Use client-side data deduplication to obtain both faster backups and increased storage savings on the IBM Spectrum Protect server.
No
Determine the importance of storage savings versus faster backup process.
What is more important to your business: The amount of storage savings that you achieve through data reduction technologies, or how quickly backups complete? Consider the trade-offs between having the fastest elapsed backup times and gaining the maximum amount of storage pool savings:
  • For the fastest backups in an unconstrained network, choose server-side data deduplication.
  • For the largest storage savings, choose client-side data deduplication that is combined with compression.

What to do next

For more information about using IBM Spectrum Protect deduplication, see https://www.ibm.com/developerworks/community/wikis/home/wiki/Tivoli%20Storage%20Manager/page/Container%20Pool%20Best%20Practices.