IBM Tivoli Storage Manager, Version 7.1

Performance considerations for data deduplication

Finding duplicate data is a processor-intensive process. When you use client-side data deduplication, the processor consumption occurs on the client system during the backup. With server-side data deduplication, the processor consumption occurs on the server during the duplicate identification process. Consider factors such as processor usage, network bandwidth, restore performance, and compression when you decide to use data deduplication.

Processor usage

The amount of processor resources that are used depends on how many client sessions or server processes are simultaneously active. Additionally, the amount of processor usage is increased because of other factors, such as the size of the files that are backed up. When I/O bandwidth is available and the files are large, for example 1 MB, finding duplicates can use an entire processor during a session or process. When files are smaller, other bottlenecks can occur. These bottlenecks can include reading files from the client disk or the updating of the Tivoli® Storage Manager server database. In these bottleneck situations, data deduplication might not use all of the resources of the processor.

You can control processor resources by limiting or increasing the number of client sessions for a client or a server duplicate identification processes. To take advantage of your processor and to complete data deduplication faster, you can increase the number of identification processes or client sessions for the client. The increase can be up to the number of processors that are on the system. It can be more than that number if the processors support multiple hardware-assisted threads for the core, such as with simultaneous multithreading. Consider a minimum of at least 8 (2.2Ghz or equivalent) processor cores in any Tivoli Storage Manager server that is configured for data deduplication.

Client-side data deduplication can use a large amount of processor resources. Therefore, verify that the additional workload does not affect the primary workload of the client system.

Compressing the data, in addition to deduplicating it on the client, uses additional processor resources. However, it lowers the network bandwidth that is required if the data is compressible.

Network bandwidth

A primary reason to use client-side data deduplication is to reduce the bandwidth that is required to transfer data to a Tivoli Storage Manager server. Client compression can reduce this bandwidth further. The amount that the bandwidth is reduced by is directly related to how much of the data is duplicate that is already stored on the server. It is also directly related to how compressible this data is.

Network bandwidth for the queries for data from the Tivoli Storage Manager client to the server can be reduced by using the enablededupcache client option. The cache stores information about extents that have been previously sent to the server. If an extent is found that was previously sent, it is not necessary to query the server again for that extent. Therefore, bandwidth and performance are not additionally reduced.

Restore performance

During a restore operation, performance for a deduplicated storage pool can be slower than a restore from a non-deduplicated pool. When data deduplication is used, the extents for a given file can be spread across multiple volumes on the server. This spreading of extents makes the reads from the volumes more random in nature, and also slower than during a sequential operation. In addition, more database operations are required.

Compression

Data deduplication is not performed on directories or file metadata. In comparison, compression can be performed on these types of data. Therefore, the reduction percentages do not typically add up to the total data-reduction percentage. When client-side data deduplication is used, the calculation of the compression-reduction percentage is performed differently. The calculation includes only the actual data reduction that results from use of the compression engine. Client-side data deduplication, when performed, occurs before compression.

For the quickest backups on an unconstrained network, choose server-side data deduplication. For the largest storage savings, choose client-side data deduplication combined with compression. Avoid performing client-compression in combination with server-side data deduplication.

Feedback