Integrated support for data deduplication devices in DB2 for Linux, UNIX, and Windows

Learn how data deduplication offers benefits such as reducing storage requirements, improving backup performance, and reducing traffic across your network. This article explains the basic technology of deduplication, discusses the various methodologies that you can use, and shows how DB2® for Linux®, UNIX®, and Windows® has been changed to support various deduplication devices. Finally, the article includes the results of benchmark testing to illustrate how effective the Tivoli® Storage Manager server data deduplication feature can be with DB2.

Dale McInnis (dmcinnis@ca.ibm.com), Senior Technical Staff Member (STSM), IBM  

Dale McInnis author photoDale McInnis is a Senior Technical Staff Member (STSM) at the IBM Toronto Canada lab. He has a B.Sc.(CS) from the University of New Brunswick and a Masters of Engineering from the University of Toronto. Dale joined IBM in 1988, and has been working on the DB2 development team since 1992. Dale's area of expertise includes DB2 for LUW Kernel development, where he led teams that designed the current backup and recovery architecture and other key high availability and disaster recovery technologies. Dale is a popular speaker at the International DB2 Users Groups (IDUG) conferences worldwide, as well as DB2 Regional users groups and IBM's Information On Demand (IOD) conference. His expertise in the area DB2 availability area is well known in the information technology industry. Dale currently fills the role of DB2 Availability Architect at the IBM Toronto Canada Lab.



14 February 2012

Also available in Chinese

What is data deduplication and how does it work?

Data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data. This technique is used to improve storage utilization. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced. The concept is depicted in Figure 1.

Figure 1. Data deduplication
shows unique blocks and duplicate blocks, and how they appear in the data source and the target data store

For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored. Each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only one MB plus some nominal overhead for references to the deduplicated data.

Data deduplication offers other benefits. Lower storage space requirements will save money on disk expenditures. The more efficient use of disk space also allows for longer disk retention periods, which provides better recovery time objectives (RTO) for a longer time and reduces the need for tape backups. Data deduplication also reduces the data that must be sent across a WAN for remote backups, replication, and disaster recovery.

Data deduplication can generally operate at the file or block level. File deduplication eliminates duplicate files (as in the example above), but this is not a very efficient means of deduplication. Block deduplication looks within a file and saves unique iterations of each block. Each chunk of data is processed using a hash algorithm such as MD5 or SHA-1. This process generates a unique number for each piece which is then stored in an index. If a file is updated, only the changed data is saved. That is, if only a few bytes of a document or presentation are changed, only the changed blocks are saved. The changes don't constitute an entirely new file. This behavior makes block deduplication far more efficient. However, block deduplication takes more processing power and uses a much larger index to track the individual pieces.

Typically, deduplication is performed using one of two methodologies: in-line or post-process. With in-line deduplication, hash calculations and lookups are performed before data is written to disk. Consequently, in-line deduplication significantly reduces the raw disk capacity needed since not-yet-deduplicated data is never written to disk. For this reason, in-line deduplication is often considered the most efficient and economic deduplication method available. However, because it takes time to perform hash calculations and lookups, in-line deduplication can slow some operations down, although certain in-line deduplication solution vendors have been able to achieve performance that is comparable to that of post-process deduplication.

With post-process deduplication, all data is written to storage before the deduplication process is initiated. The advantage to this approach is that there is no need to wait for hash calculations and lookups to complete before data is stored. The drawback is that a greater amount of available storage is needed initially since duplicate data must be written to storage for a brief period of time. This method also increases the lag time before deduplication is complete.

How a traditional DB2 backup operation works

The DB2 backup utility was designed for maximum performance. When the utility is invoked there are three groups of threads created:

  • DB2 agent (db2agent) thread
  • Buffer manipulator (db2bm) thread
  • Media controller (db2med) thread

The db2agent thread is responsible for all communications between the user and its subordinate threads. The db2bm thread is responsible for reading data from an assigned table space and placing the data in shared memory buffers. The db2med thread is responsible for reading data from the shared memory buffers and writing it to the target device.

The number of db2bm threads used is controlled by the PARALLELISM option of the BACKUP DATABASE command. The number of db2med threads used is controlled by the OPEN n SESSIONS option or the number of target devices.) This process can be seen in Figure 2.

Figure 2. DB2's backup process model
Data flows in parallel from various tablespaces, using the buffer manipulator threads to go to backup buffers, and then using media controller threads to go to storage media

Each db2bm thread reads data from its assigned table space and places the filled buffer's address on a shared memory queue. The db2med threads then use a first in, first out (FIFO) algorithm to pull the backup buffers from a shared memory queue. Since there is only one shared memory queue that contains the full buffers, each db2med thread can't predict where the data with their buffer originated, resulting in the data being multiplexed across all of the output streams from all of the table spaces. This behavior is illustrated in Figure 3.

As a result, when the output streams are directed to a deduplication device, the device thrashes in an attempt to identify chunks of data that have already been backed up because the data is arriving in a different order each and every time the database is backed up.

Figure 3. Default database backup behavior
shows 3 tablespaces being backed up using TSM with 3 sessions, going to 3 backup image output streams in random chunks

Note that the metadata for a table space will appear in an output stream before any of its data and that empty extents are never placed in an output stream.


How was DB2 modified to support data deduplication devices

Data deduplication devices work best when data is sent to it in a consistent and predicable manner; that is, the data arrives in the same order each time. If the data is received in random order, then the data deduplication device will have to search throughout its entire index looking for matches, thus increasing the work that needs to be done by the device. As such, a new option (DEDUP_DEVICE) was added to the backup utility to indicate that the target device is a data deduplication-enabled device. This new option will ensure the data will be sent to the target device in a consistent and predictable manner. The DEDUP_DEVICE option will alter the way the data is processed by the media controllers. Essentially the data from a single table space will be sent in order to one and only one media controller (db2med) thread.

All data for a particular table space is always written in table space page order, from lowest to highest. This predictable and deterministic pattern of the data in each output stream makes it easy for a deduplication device to identify chunks of data that have been backed up previously. Figure 4 illustrates this change in backup behavior when the DEDUP_DEVICE option of the BACKUP DATABASE command is used.

Figure 4. Default database backup behavior when DEDUP_DEVICE option is specified
shows 3 tablespaces being backed up using TSM with 3 sessions, going to 3 backup image output streams, a separate stream for each tablespace

One of the initial customers to utilize the DEDUP_DEVICE option on DB2 backup experienced the following. The customer's backups of 4 TB had exceeded 6.5 hours and were getting poor deduplication results, 2:1 or 3:1. By utilizing the DEDUP_DEVICE option on the backup invocation, the backup elapsed time decreased to 5.5 hours and the deduplication results are between 11:1(90% savings) and 15:1 (93% savings). Note that any result depends on the volatility of the data. The less the data changes, the higher the data deduplication ratio will be.


How does DB2 incremental backup compare to a data deduplicated backup?

A DB2 incremental backup reads all of the pages in a table space and only sends the changed pages to the backup image. The entire LOB and Long Field data that exists in the table space will be added to the backup image in its entirety due to the lack of a fixed page format. As such, a DB2 incremental backup produces a very similarly sized backup object as that of a data deduplicated backup image. Essentially only the new pages consume space.

One advantage of the data deduplicated backup over an incremental backup is the way LOBs are handled. As previously mentioned an incremental backup always includes the entire LOB. One disadvantage of a data deduplicated backup is that it sends the entire table space's content over the LAN/SAN to the data deduplication device, thus consuming a lot of bandwidth that is not consumed with a DB2 incremental backup.


Is compression compatible with data deduplication devices?

There are several forms of compression available for DB2 DBAs to explore, namely:

  • Row compression (also called table compression)
  • Adaptive compression (also called page compression)
  • DB2 backup compression
  • Tivoli Storage Manager (TSM) client compression

The previous rule of thumb was that any form of compression is incompatible with data deduplication. Testing has revealed that this assumption is false and that there are circumstances where compression and data deduplication are completely compatible. The key factor that must be determined is this: if the data remains unchanged, will the physical binary representation of the data change between backups if compression is used?

For the first two items on the list above, row and adaptive compression, the answer is no. Once the data is compressed on disk the binary format of the data will not change between backups unless the data has been modified. This is referred to as static compression. As long as the data does not change the representation remains the same. This type of compression is compatible with data deduplication, as the data deduplication device can easily detect the pattern.

The other two forms of compression on the list, db2 backup and TSM compression, are referred to as dynamic compression. Each time the database is backed up, the binary presentation of the data may change depending on where in the data stream the data falls. Both compression techniques use a "sliding window" to detect patterns. If the alignment of the window is not identical between backups, then the pattern detection will be the result of a different compressed output, thus lowering the possibility for the data deduplication device to find a pattern match.


Performance recommendations

The tuning parameters used by DB2 backup to perform optimally to a data deduplication device is somewhat different than that used to backup to a non-data deduplication device. Specifically, data deduplication devices perform better with larger buffer sizes, for example 8192 or 16384, as well as more target sessions. The additional target sessions are required as the DB2 backup will no longer multiplex the data across the target devices, but rather target each target device with the data from a single table space.

The default behavior for DB2 backup is to be optimized for through-put, thus it will multiplex the data from all table spaces across all sessions to TSM. The result can be a poor factoring ratio on the data deduplication device. To counter this effect, use the largest buffer size possible, namely 16384, as well as more target sessions. The additional target sessions are required as the DB2 backup will no longer multiplex the data across the target devices, but rather target each target device with the data from a single table space. To obtain the optimal data deduplication ratio, lower the number of sessions and parallelism, however at the cost of a longer elapsed time for the DB2 backup to complete.

Some basic rules of thumb are:

  • Change the logarchmeth1 to ensure that the archived logs are not stored on a data deduplication device.
  • Increase util_heap_sz to at least 200000.
  • Change the buffer size (buffer) to 16384.

Here's an example DB2 backup invocation:

db2 backup db databasename use tsm open 10 sessions buffer 16384

Note: The operation above requires 1.3GB of memory. If it is too much, use buffer 8192 instead of buffer 16384.


Tivoli Storage Manager native data deduplication capabilities

TSM provides two options for performing deduplication: client-side and server-side deduplication. Both methods use the same algorithm to identify redundant data, however the when and where of the deduplication processing is different.

TSM server side deduplication

With server-side deduplication, all of the processing of redundant data occurs on the TSM server, after the data has been backed up. Server-side deduplication is also called "target-side" deduplication. The key characteristics of server-side deduplication are:

  • Duplicate data is identified after backup data has been transferred to the storage pool volume.
  • The duplicate identification processing must run regularly on the server, and will consume TSM server CPU and TSM database resources.
  • Storage pool data reduction is not realized until data from the deduplication storage pool is moved to another storage pool volume, usually through a reclamation process, but can also occur during a TSM "MOVE DATA"” process.

TSM client side deduplication

Client-side deduplication processes the redundant data during the backup process on the host system where the source data is located. The net results of deduplication are virtually the same as with server-side deduplication, except that the storage savings are realized immediately, since only the unique data needs to be sent to the server in its entirety. Data that is duplicated requires only a small signature to be sent to the TSM server. Client-side duplication is especially effective when it is important to conserve bandwidth between the TSM client and server.

TSM deduplication

The Tivoli Storage Manager administrator can specify the data deduplication location (client or server) to use with the DEDUP parameter on the REGISTER NODE or UPDATE NODE server command.

The IBM Tivoli Storage Manager server- and client-side data deduplication allow sharing of data extents generated on the source, the target, and vice versa. The deduplication approach addresses the problems in the preceding list by allowing a user to switch deduplication location between client (source) and server (target). The location of deduplication can be selected at a node level. For example, Node A can do client-side data deduplication and Node B can do server-side data deduplication. Data deduplication can also be done at a file level. In each case, extents can be reused between the two nodes and between different files. Please note that if you are using a LAN-free configuration, TSM client side data deduplication is not supported.

Where to enable TSM data deduplication

After you decide on an architecture using deduplication for your TSM server, you need to decide whether you will perform deduplication on the TSM clients, the TSM server, or using a combination of the two. The TSM deduplication implementation allows storage pools to manage deduplication performed by both clients and the TSM server. The server is optimized to only perform deduplication on data that has not been deduplicated by the TSM clients. Furthermore, duplicate data can be identified across objects regardless of whether the deduplication is performed on the client or server.

These benefits allow for hybrid configurations that efficiently apply client-side deduplication to a subset of clients, and use server-side deduplication for the remaining clients. Typically a combination of both client-side and server-side data deduplication is the most appropriate. Here are some further points to consider:

  • Server-side deduplication is a two-step process of duplicate data identification followed by reclamation to remove the duplicate data. Client-side deduplication stores the data directly in a deduplicated format, reducing the need for the extra reclamation processing.
  • Deduplication on the client can be combined with compression to provide the largest possible storage savings.
  • Client-side deduplication processing can increase backup durations. Expect increased backup durations if network bandwidth is not restrictive. A doubling of backup durations is a reasonable estimate when using client-side deduplication in an environment that is not constrained by the network. In addition, if you will be creating a secondary copy using storage pool backup where the copy storage pool is not using deduplication, the data movement will take longer due to the extra processing required to reconstruct the deduplicated data.
  • Client-side deduplication can place a significant load on the TSM server in cases where a large number of clients are simultaneously driving deduplication processing. The load is a result of the TSM server processing duplicate chunk queries from the clients. Server-side deduplication, on the other hand, typically has a relatively small number of identification processes running in a controlled fashion.
  • Client-side deduplication cannot be combined with LAN-free data movement using the Tivoli Storage Manager for SAN feature. If you are implementing one of TSM's supported LAN-free to disk solutions, then you can still consider using server-side deduplication.

For more information on now to optimize TSM for data deduplication refer to the deduplication topic in the Tivoli Storage Manager wiki in the developerWorks community.

Performance results utilizing TSM 6.2 Server data deduplication

The following sets of tests were performed on a DB2 V10.1 database consisting of 16 table spaces. Each backup was a full database offline backup with no user data change between iterations. The intent of these tests was to see how effective the TSM server data deduplication feature could be.

Test #1

Settings:
Util_heap_sz=300000
Network: 1 GB Ethernet

Table 1. Command: DB2 backup db PERF use TSM open 8 sessions dedup_device with 18 buffers buffer 16384 parallelism 8
StartEndElapsed time HH:MM:SSActual DB size in bytesData size stored on the TSM server% savings
1:17:08 PM1:45:16 PM0:28:081364994908161364994908160%
1:45:16 PM2:11:42 PM0:26:261364994908161364994908160%
2:11:42 PM2:38:51 PM0:27:091364994908161364994908160%
2:38:51 PM3:05:51 PM0:27:001364994908161364994908160%
3:05:51 PM3:32:58 PM0:27:071364994908161364994908160%

After IDENTIFY DUPLICATES and RECLAIM STGPOOL were run on the TSM server, the duplicate data was removed. The query to determine the actual storage used is as follows:

select sum(reporting_mb) as "Before dedup (MB)", sum(logical_mb) as "After dedup (MB)",
 (sum(reporting_mb) - sum(logical_mb)) 
  as "Dedup savings (MB)" from tsmdb1.occupancy where STGPOOL_NAME='DEDUPPOOL'
Table 2. Deduplication savings
Before dedup (MB)After dedup (MB)Dedup savings (MB)
650978.52130532.88520445.64

Total savings 80%

Network: 1 GB Ethernet

Table 3. Command: DB2 backup db PERF use TSM open 8 sessions with 18 buffers buffer 16384 parallelism 8
StartEndElapsed time HH:MM:SSActual DB size in bytesData size stored on the TSM server% savings
2:36:50 PM3:06:13 PM0:29:231364994908161364994908160%
3:06:13 PM3:33:13 PM0:27:001364994908161364994908160%
3:33:13 PM4:00:33 PM0:27:001364994908161364994908160%
4:00:33 PM4:28:11 PM0:27:381364994908161364994908160%
4:28:11 PM4:55:15 PM0:27:041364994908161364994908160%

After IDENTIFY DUPLICATES and RECLAIM STGPOOL were run on the TSM server, the duplicate data was removed. The query to determine the actual storage used is as follows:

select sum(reporting_mb) as "Before dedup (MB)", sum(logical_mb) as "After dedup (MB)",
  (sum(reporting_mb) - sum(logical_mb)) as "Dedup savings (MB)" from tsmdb1.occupancy 
  where STGPOOL_NAME='DEDUPPOOL'
Table 4. Deduplication savings
Before dedup (MB)After dedup (MB)Dedup savings (MB)
650978.69137816.74513161.95

Total savings 78%

Note that without the dedup_device option the IDENTIFY DUPLICATES process ran much longer and consumed more CPU resources on the TSM server.

Performance results utilizing TSM 6.2 client side data deduplication

The following sets of tests were performed on DB2V 10.1 database consisting of 16 table spaces. Each backup was a full database offline backup with no user data change between iterations. The intent of these tests was to see how effective the TSM client side data deduplication feature could be.

Test #1

Settings:
Util_heap_sz=50000
Network: 100MB Ethernet

Table 5. Command: DB2 backup db PERF use TSM open 8 sessions dedup_device with 18 buffers buffer 2465 parallelism 8
StartEndElapsed time HH:MM:SSActual DB size in GBDedup size sent to server in GB% savings in bandwidth
12:01:31 PM12:52:07 PM0:50:36126.57126.42
12:52:07 PM1:19:13 PM0:27:06126.572.7697.8%
1:19:13 PM1:47:15 PM0:28:02126.571.8598.5%
1:47:15 PM2:15:17 PM0:28:02126.571.2399.0%
2:15:17 PM2:45:00 PM0:29:43126.570.8999.3%

After IDENTIFY DUPLICATES and RECLAIM STGPOOL were run on the TSM server, the duplicate data was removed. The query to determine the actual storage used is as follows:

select sum(reporting_mb) as "Before dedup (MB)", sum(logical_mb) as "After dedup (MB)",
  (sum(reporting_mb) - sum(logical_mb)) as "Dedup savings (MB)" 
  from tsmdb1.occupancy where STGPOOL_NAME='DEDUPPOOL'
Table 6. Deduplication savings
Before dedup (MB)After dedup (MB)Dedup savings (MB)
648080.26136409.84511670.42

Total savings 79.0%

Settings:
Util_heap_sz=50000
Network: 1 GB Ethernet

Table 7. Command: DB2 backup db PERF use TSM open 8 sessions dedup_device with 18 buffers buffer 2465 parallelism 8
StartEndElapsed time HH:MM:SSActual DB size in bytesDedup size sent to server in bytes% savings in bandwidth
7:38:00 AM8:29:43 AM0:51:43135900835840135745616612
8:29:43 AM8:57:02 AM0:27:19135900835840271891386998%
8:57:02 AM9:24:31 AM0:27:29135900835840174698540398.7%
9:24:31 AM9:53:46 AM0:29:15135900835840141556440399%
9:53:46 AM10:22:05 AM0:28:19135900835840100472288399.3%

After IDENTIFY DUPLICATES and RECLAIM STGPOOL were run on the TSM server, the duplicate data was removed. The query to determine the actual storage used is as follows:

select sum(reporting_mb) as "Before dedup (MB)", sum(logical_mb) as "After dedup (MB)",
  (sum(reporting_mb) - sum(logical_mb)) 
  as "Dedup savings (MB)" from tsmdb1.occupancy where STGPOOL_NAME='DEDUPPOOL'
Table 8. Deduplication savings
Before dedup (MB)After dedup (MB)Dedup savings (MB)
648080.11136095.33511984.78

Total savings 79%

Network: 1 GB Ethernet

Table 9. Command: DB2 backup db PERF use TSM open 8 sessions dedup_device with 18 buffers buffer 4097 parallelism 8
StartEndElapsed time HH:MM:SSActual DB size in bytesDedup size sent to server in bytes% savings in bandwidth
1:35:53 PM2:47:34 PM1:11:41135900835840135755504757
2:47:34 PM3:28:58 PM0:41:24135900835840572986903996%
3:28:58 PM4:12:11 PM0:43:13135900835840431600105596.8%
4:12:11 PM4:55:55 PM0:43:44135900835840350422597497.4%
4:55:55 PM5:31:12 PM0:35:17135900835840298785264397.8%

After IDENTIFY DUPLICATES and RECLAIM STGPOOL were run on the TSM server, the duplicate data was removed. The query to determine the actual storage used is as follows:

select sum(reporting_mb) as "Before dedup (MB)", sum(logical_mb) as "After dedup (MB)",
  (sum(reporting_mb) - sum(logical_mb)) as "Dedup savings (MB)" 
  from tsmdb1.occupancy where STGPOOL_NAME='DEDUPPOOL'
Table 10. Deduplication savings
Before dedup (MB)After dedup (MB)Dedup savings (MB)
648084.24145314.66502769.58

Total savings 77.6%

Test #2

Settings:
Util_heap_sz=75000
Network: 100 MB Ethernet

Table 11. Command: DB2 backup db PERF use TSM open 8 sessions with 18 buffers buffer 2465 parallelism 8
StartEndElapsed time HH:MM:SSActual DB size in bytesDedup size sent to server in bytes% savings in bandwidth
1:01:57 PM1:44:46 PM0:42:49135945469952135723707795
1:44:46 PM2:12:23 PM0:27:37135945469952171256132598.7%
2:12:23 PM2:39:29 PM0:27:06135945469952116673415199.1%
2:39:29 PM3:07:21 PM0:27:5213594546995266446527399.5%
3:07:21 PM4:38:42 PM0:31:2113594546995277725103899.4%

After IDENTIFY DUPLICATES and RECLAIM STGPOOL were run on the TSM server, the duplicate data was removed. The query to determine the actual storage used is as follows:

select sum(reporting_mb) as "Before dedup (MB)", sum(logical_mb) as "After dedup (MB)",
  (sum(reporting_mb) - sum(logical_mb)) as "Dedup savings (MB)" 
  from tsmdb1.occupancy where STGPOOL_NAME='DEDUPPOOL'
Table 12. Deduplication savings
Before dedup (MB)After dedup (MB)Dedup savings (MB)
648291.89133626.74514665.15

Total savings 79.4%

Network: 1GB Ethernet

Table 13. Command: DB2 backup db PERF use TSM open 8 sessions dedup_device with 18 buffers buffer 4097 parallelism 8
StartEndElapsed time HH:MM:SSActual DB size in bytesDedup size sent to server in bytes% savings in bandwidth
4:01:02 PM4:45:04 PM0:44:02135945469952135723196925
4:45:04 PM5:13:36 PM0:28:32135945469952162964793298.8%
5:13:36 PM5:42:11 PM0:28:35135945469952118154973399.1%
5:42:11 PM6:10:25 PM0:28:1413594546995288638414999.3%
6:10:25 PM6:39:01 PM0:28:3613594546995260228008999.6%

After IDENTIFY DUPLICATES and RECLAIM STGPOOL were run on the TSM server, the duplicate data was removed. The query to determine the actual storage used is as follows:

select sum(reporting_mb) as "Before dedup (MB)", sum(logical_mb) as "After dedup (MB)",
  (sum(reporting_mb) - sum(logical_mb)) as "Dedup savings (MB)" 
  from tsmdb1.occupancy where STGPOOL_NAME='DEDUPPOOL'
Table 14. Deduplication savings
Before dedup (MB)After dedup (MB)Dedup savings (MB)
648291.89133606.08514685.80

Total savings 79.4%

Network: 1GB Ethernet

Table 15. Command: DB2 backup db PERF use TSM open 8 sessions with 18 buffers buffer 4097 parallelism 8
StartEndElapsed time HH:MM:SSActual DB size in bytesDedup size sent to server in bytes% savings in bandwidth
5:20:37 AM6:08:14 AM0:47:37135945469952135740137892
6:08:14 AM6:34:32 AM0:26:17135945469952417858640296.9%
6:34:32 AM7:01:44 AM0:27:12135945469952310606573097.7%
7:01:44 AM7:29:43 AM0:27:59135945469952258675268898.1%
7:29:43 AM8:10:32 AM0:40:49135945469952233629024398.3%

After IDENTIFY DUPLICATES and RECLAIM STGPOOL were run on the TSM server, the duplicate data was removed. The query to determine the actual storage used is as follows:

select sum(reporting_mb) as "Before dedup (MB)", sum(logical_mb) as "After dedup (MB)",
  (sum(reporting_mb) - sum(logical_mb)) as "Dedup savings (MB)" 
  from tsmdb1.occupancy where STGPOOL_NAME='DEDUPPOOL'
Table 16. Deduplication savings
Before dedup (MB)After dedup (MB)Dedup savings (MB)
648295.32141168.09507127.23

Total savings 78.2%

Test #3

Settings:
Util_heap_sz=300000
Network: 100MB Ethernet

Table 17. Command: DB2 backup db PERF use TSM open 8 sessions dedup_device with 18 buffers buffer 16384 parallelism 8
StartEndElapsed time HH:MM:SSActual DB size in bytesDedup size sent to server in bytes% savings in bandwidth
11:33:02 PM0:24:22 AM0:51:20136499490816135729252986
0:24:22 AM0:54:33 AM0:30:1113649949081642031717799.7%
0:54:33 AM1:23:07 PM0:28:2413649949081615483796799.9%
1:23:07 PM1:49:59 AM0:26:5213649949081611955723099.9%
1:49:59 AM2:17:42 AM0:27:4313649949081631172873399.8%

After IDENTIFY DUPLICATES and RECLAIM STGPOOL were run on the TSM server, the duplicate data was removed. The query to determine the actual storage used is as follows:

select sum(reporting_mb) as "Before dedup (MB)", sum(logical_mb) as "After dedup (MB)",
  (sum(reporting_mb) - sum(logical_mb)) as "Dedup savings (MB)" 
  from tsmdb1.occupancy where STGPOOL_NAME='DEDUPPOOL'
Table 18. Deduplication savings
Before dedup (MB)After dedup (MB)Dedup savings (MB)
650832.30130469.26520463.04

Total savings 80.0%

Network: 1GB Ethernet

Table 19. Command: DB2 backup db PERF use TSM open 8 sessions dedup_device with 18 buffers buffer 16384 parallelism 8
StartEndElapsed time HH:MM:SSActual DB size in bytesDedup size sent to server in bytes% savings in bandwidth
8:41:58 AM9:40:35 AM0:58:37136499490816135729352990
9:40:35 AM10:07:13 AM0:26:3813649949081671273939999.5%
10:07:13 AM10:32:49 AM0:25:3613649949081630598355499.8%
10:32:49 AM11:01:43 AM0:28:5413649949081627018630799.8%
11:01:43 AM11:28:44 AM0:27:0113649949081615206612599.9%

After IDENTIFY DUPLICATES and RECLAIM STGPOOL were run on the TSM server, the duplicate data was removed. The query to determine the actual storage used is as follows:

select sum(reporting_mb) as "Before dedup (MB)", sum(logical_mb) as "After dedup (MB)",
  (sum(reporting_mb) - sum(logical_mb)) as "Dedup savings (MB)" 
  from tsmdb1.occupancy where STGPOOL_NAME='DEDUPPOOL'
Table 20. Deduplication savings
Before dedup (MB)After dedup (MB)Dedup savings (MB)
650932.50130884.00520048.50

Total savings 79.9%

Network: 1GB Ethernet

Table 21. Command: DB2 backup db PERF use TSM open 8 sessions with 18 buffers buffer 16384 parallelism 8
StartEndElapsed time HH:MM:SSActual DB size in bytesDedup size sent to server in bytes% savings in bandwidth
1:56:22 PM2:33:08 PM0:36:46136499490816135747381965
2:33:08 PM2:59:27 PM0:25:19136499490816180403708198.7%
2:59:27 PM3:27:15 PM0:28:48136499490816126341726099.1%
3:27:15 PM3:55:49 PM0:28:34136499490816116913437099.1%
3:55:49 PM4:24:23 PM0:28:34136499490816121559162699.1%

After IDENTIFY DUPLICATES and RECLAIM STGPOOL were run on the TSM server, the duplicate data was removed. The query to determine the actual storage used is as follows:

select sum(reporting_mb) as "Before dedup (MB)", sum(logical_mb) as "After dedup (MB)",
  (sum(reporting_mb) - sum(logical_mb)) as "Dedup savings (MB)" 
  from tsmdb1.occupancy where STGPOOL_NAME='DEDUPPOOL'
Table 22. Deduplication savings
Before dedup (MB)After dedup (MB)Dedup savings (MB)
650934.29134728.84516205.45

Total savings 79.3%


Conclusion

The use of the DEDUP_DEVICE on the backup invocation will result in a backup image is that optimized for data deduplication devices. The option itself may result in an increase of the elapsed time for the backup utility to complete; however it has been demonstrated by several customers that the elapsed time remains approximately the same as without the option. These customers have reported that there is a huge increase in the data deduplication device's ability to identify duplicate blocks when the DEDUP_DEVICE option is specified.

Where possible use the TSM client-side data deduplication feature, as it will have the biggest positive impact on DB2 backup performance. The next best alternative would be to use a target device that supports data deduplication, such as the IBM System Storage TS7650G ProtecTIER Deduplication Gateway.

Even though DB2 Backup Compression is very effective, it does have a dramatic negative effect on the overall elapsed time for backup if a data deduplication device is being used.

As was demonstrated by the TSM client side data deduplication results, using the fastest possible network interface will make a substantial difference in the overall backup elapsed time.

Resources

Learn

Get products and technologies

  • Download a free trial version of DB2 for Linux, UNIX, and Windows.
  • Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edition and provides a solid base to build and deploy applications.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=857969
ArticleTitle=Integrated support for data deduplication devices in DB2 for Linux, UNIX, and Windows
publish-date=02142012