Checklist for data deduplication

Data deduplication requires more processing resources on the server or client. Use the checklist to verify that hardware and your IBM Spectrum Protect™ configuration have characteristics that are key to good performance.

Question Tasks, characteristics, options, or settings More information
Are you using fast disk storage for the IBM Spectrum Protect database as measured in terms of input/output operations per second (IOPS)?

Use a high-performance disk for the IBM Spectrum Protect database. At a minimum, use 10,000 rpm drives for smaller databases that are 200 GB or less. For databases over 500 GB, use 15,000 rpm drives or solid-state drives.

Ensure that the IBM Spectrum Protect database has a minimum capability of 3000 IOPS. For each TB of data that is backed up daily (before data deduplication), include an extra 1000 IOPS to this minimum.

For example, an IBM Spectrum Protect server that is ingesting 3 TB of data per day would need 6000 IOPS for the database disks:
3000 IOPS minimum + 3000 (3 
TB x 1000 IOPS) = 6000 IOPS
Checklist for server database disks

For more information about IOPS, see the IBM Spectrum Protect Blueprint at IBM Spectrum Protect Blueprints

Do you have enough memory for the size of your database? Use a minimum of 64 GB of system memory for IBM Spectrum Protect servers that are deduplicating data. If the retained capacity of backup data grows, the memory requirement might need to be higher.

Monitor memory usage regularly to determine whether more memory is required.

Use more system memory to improve caching of database pages. The following memory size guidelines are based on the daily amount of new data that you back up:
  • 128 GB of system memory for daily backups of data, where the database size is 1 - 2 TB
  • 192 GB of system memory for daily backups of data, where the database size is 2 - 4 TB
Memory requirements
Have you properly sized the storage capacity for the database active log and archive log? The suggested starting size for the active log is 16 GB.

Configure the server to have an maximum active log size of 128 GB by setting the ACTIVELOGSIZE server option to a value of 131072.

The suggested starting size for the archive log is 48 GB. The size of the archive log is limited by the size of the file system on which it is located, and not by a server option. Make the archive log at least as large as the active log.

Use a directory for the database archive logs with an initial free capacity of at least 500 GB. Specify the directory by using the ARCHLOGDIRECTORY server option.

Define space for the archive failover log by using the ARCHFAILOVERLOGDIRECTORY server option.

 
Are the IBM Spectrum Protect database and logs on separate disk volumes (LUNs)?

Is the disk that is used for the database configured according to best practices for a transactional database?

The database must not share disk volumes with IBM Spectrum Protect database logs or storage pools, or with any other application or file system.

See Server database and recovery log configuration and tuning
Are you using a minimum of eight (2.2 GHz or equivalent) processor cores for each IBM Spectrum Protect server that you plan to use with data deduplication? If you are planning to use client-side data deduplication, verify that client systems have adequate resources available during a backup operation to complete data deduplication processing. Use a processor that is at least the minimum equivalent of one 2.2 GHz processor core per backup process with client-side data deduplication.

Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication.

Have you properly sized disk space for storage pools? For a rough estimate, plan for 100 GB of database storage for every 10 TB of data that is to be protected in deduplicated storage pools. Protected data is the amount of data before deduplication, including all versions of objects stored.

As a best practice, define a new container storage pool exclusively for data deduplication. Data deduplication occurs at the storage-pool level, and all data within a storage pool, except encrypted data, is deduplicated.

Checklist for container storage pools
Have you estimated storage pool capacity to configure enough space for the size of your environment? You can estimate capacity requirements for a deduplicated storage pool by using the following technique:
  1. Estimate the base size of the source data.
  2. Estimate the daily backup size by using an estimated change and growth rate.
  3. Determine retention requirements.
  4. Estimate the total amount of source data by factoring in the base size, daily backup size, and retention requirements.
  5. Apply the deduplication ratio factor.
  6. Round up the estimate to consider transient storage pool usage.

Effective Planning and Use of IBM Tivoli Storage Manager V6 and V7 Deduplication.

Have you distributed disk I/O over many disk devices and controllers? Use arrays that consist of as many disks as possible, which is sometimes referred to as wide striping.

When I/O bandwidth is available and the files are large, for example 1 MB, the process of finding duplicates can occupy the resources of an entire processor during a session or process. When files are smaller, other bottlenecks can occur.

Specify eight or more file systems for the deduplicated storage pool device class so that I/O is distributed across as many LUNs and physical devices as possible.

See Checklist for storage pools on DISK or FILE.
Have you scheduled data deduplication processing based on your backup strategy? If you are not creating a secondary copy of backup data or if you are using node replication for the second copy, client backup and duplicate identification can be overlapped. This can reduce the total elapsed time for these operations, but might increase the time that is required for client backup.

If you are using storage pool backup, do not overlap client backup and duplicate identification. The best practice sequence of operations is client backup, storage pool backup, and then duplicate identification.

For data that is not stored with client-side data deduplication, schedule storage-pool backup operations to complete before you start data deduplication processing. Set up your schedule this way to avoid reconstructing objects that are deduplicated to make a non-deduplicated copy to a different storage pool.

Consider doubling the time that you allow for backups when you use client-side data deduplication in an environment that is not limited by the network.

Ensure that you schedule data deduplication before you schedule compression.

See Scheduling data deduplication and node replication processes.
Are the processes for identifying duplicates able to handle all new data that is backed up each day? If the process completes, or goes into an idle state before the next scheduled operation begins, then all new data is being processed.

The duplicate identification (IDENTIFY) processes can increase the workload on the processor and system memory.

If you use a container storage pool for data deduplication, duplicate identification processing is not required.

If you update an existing storage pool, you can specify 0 - 20 duplicate identification processes to start automatically. If you do not specify any duplicate-identification processes, you must start and stop processes manually.

 
Is reclamation able to run to a sufficiently low threshold? If a low threshold cannot be reached, consider the following actions:
  • Increase the number of processes that are used for reclamation.
  • Upgrade to faster hardware.
 
Do you have enough storage to manage the DB2 lock list? If you deduplicate data that includes large files or large numbers of files concurrently, the process can result in insufficient storage space. When the lock list storage is insufficient, backup failures, data management process failures, or server outages can occur.

File sizes greater than 500 GB that are processed by data deduplication are most likely to deplete storage space. However, if many backup operations use client-side data deduplication, this problem can also occur with smaller-sized files.

For information about tuning the DB2® LOCKLIST parameter, see Tuning server-side data deduplication.
Is deduplication cleanup processing able to clean out the dereferenced extents to free disk space before the start of the next backup cycle? Run the SHOW DEDUPDELETE command. The output shows that all threads are idle when the workload is complete.
If cleanup processing cannot complete, consider the following actions:
  • Increase the number of processes that are used for duplicate identification.
  • Upgrade to faster hardware.
  • Determine whether the IBM Spectrum Protect server is ingesting more data than it can process with data deduplication and consider deploying an extra IBM Spectrum Protect server.
 
Is sufficient bandwidth available to transfer data to an IBM Spectrum Protect server? Use client-side data deduplication and compression to reduce the bandwidth that is required to transfer data to an IBM Spectrum Protect server. For more information, see the enablededupcache client option.