- The nature of data
- The alternative row format
- Row compression
- Revealing compression settings
- Building a compression dictionary
- Automatic Dictionary Creation (ADC)
- Identifying candidate tables for row compression
- Estimating storage savings from using row compression
- Index compression
- Identifying candidate indexes for index compression
- Estimating storage space savings for index compression
- Compression of temporary tables
- Adoption strategies
- Returning unused space to the file system
- Compression and the Load utility
- Compression of backup images and log archives
- Downloadable resources
- Related topics
Optimize storage with deep compression in DB2 10
Using deep compression to minimize storage space and improve database performance
The nature of data
Believe it or not, as the volume of data increases, the cardinality of that data tends to drop. Consequently, there just aren't that many truly unique things in the world. Things may be unique when used in combination, but the basic elements themselves aren't all that varied. Consider the periodic table of elements — everything in our world is made up of combinations of a rather small set of elements. Apply the concept to data, and you find the same is true. For instance, according to the last U.S. census, there are about 300 million people in living in the United States. But there are only about 78,800 unique last names, resulting in very low cardinality with huge "clumps" in certain name sets. First names are even worse, coming in at around 6,600 (4,400 unique first names for females and 2,200 for males). The names of cities, streets, and addresses, not to mention product names, descriptions, and attributes, also tend to be highly redundant with low cardinality.
How deep compression works
Deep compression is based on the principle that large amounts of data tend to be highly redundant. It works by searching for repeating patterns in the data and replacing those patterns with 12-bit symbols, which are stored along with the patterns they represent in a static dictionary. Once this dictionary is created, it's stored in the table (or page), along with the compressed data, and loaded into memory whenever data in the table is accessed (to aid in decompression). When deep compression is enabled, DB2 scans an entire table looking for repeating column values and repeating patterns that span multiple columns in a row or on a page. It also looks for repeating patterns that are substrings of a given column. However, just because a repeating pattern is found doesn't mean that the data is automatically compressed. Data is compressed only when storage savings will be realized. In any case, the entire row is stored as a set of 12-bit symbols; rows are never partially compressed.
Deep compression can be applied to primary user data (row data, XML data, and indices), system-generated data (temporary tables), and administrative data (backup images and archived transaction logs). Compression facilities for administrative data are available in all editions of the DB2 software; Advanced compression features for primary user and system-generated data are available through the DB2 Storage Optimization Feature.
The alternative row format
When it comes to creating user-defined tables, there are two row formats you can use —standard and alternative— and the format you choose determines how rows are packed when they are stored on disk. The alternative row format allows for more compact storage of NULL and system default values, as well as zero-length values in columns with variable-length data types. (The variable-length data types recognized by DB2 are VARCHAR, VARGRAPHIC, LONG VARCHAR, LONG VARGRAPHIC, BLOB, CLOB, DBCLOB, and XML.) Consequently, this format is often referred to as NULL and default value compression or just value compression.
With the standard row format, space for fixed-length column values is allocated even if the actual stored value is NULL. Similarly, zero-length values stored in columns with variable-length data types consume a small amount of space. However, with the alternative row format, NULL values in all columns and zero-length values in columns with variable-length data types consume no space.
When compared to the standard row format, the alternative row format also
reduces the storage overhead for all other values in columns with
variable-length data types. However, the storage consumption for all
non-NULL values that are stored in fixed-length columns increases.
Formulas for determining the byte counts for values of all supported data
types and row formats can be found in the reference material for the
CREATE TABLE statement.
Controlling row formating
You can control the row format used on a table-by-table basis. By
default, tables use the standard row format. To create a table that uses
the alternative row format, simply append the
VALUE COMPRESSION clause to the
CREATE TABLE statement that's used to
create the table. For example:
CREATE TABLE ...
To change the row format used by an existing table, issue one of the following statements instead:
ALTER TABLE ... ACTIVATE VALUE COMPRESSION ALTER TABLE ... DEACTIVATE VALUE COMPRESSION
If you activate or deactivate value compression for a table, existing data stored in the table is not modified. Instead, rows remain in their current row format unless you apply one or more of the measures presented in the Adoption strategies section of this article.
For tables that use the alternative row format, you can save additional
space by enabling default value compression for fixed-length numeric and
character columns. This results in system default values (0 for numeric
columns and blank for fixed-length character columns) not being
materialized in the on-disk representation of a row. You can enable
compression of system default values for each column individually by
COMPRESS SYSTEM DEFAULT column
option for the column in a
CREATE TABLE or
ALTER TABLE ... ALTER COLUMN SQL statement.
Deciding on which format to use
Although the standard row format is a good choice in most cases, there are some tables for which the alternative row format yields a more compact storage layout:
- Sparsely populated tables (i.e., tables that contain many rows with many NULL or system default values) should use the alternative row format. However, storage space requirements increase for a row whenever you update a NULL or system default value to a value other than NULL or the system default. This change generally causes overflow records to occur even in tables with no variable-length columns (i.e., tables in which all rows have the same space requirements when the standard row format is used).
- Tables having variable-length data types for the vast majority of their columns should use the alternative row format.
For some tables that do not have those characteristics, using the alternative row format can increase storage space requirements. A test in your environment might be worthwhile.
You can use the alternative row format regardless of whether you have a license for the DB2 Storage Optimization Feature. This flexibility allows you to choose the most compact storage layout for each table, even if you are not planning on using row compression. However, if you are planning on using row compression, choosing the more compact row format to start with yields a smaller on-disk footprint of the table in most cases, though the effect is less dramatic or even negligible. This is due to the fact that row compression can compress the materialized NULL and system default values in tables with the standard row format very well.
Row compression was introduced with DB2 9 for Linux, UNIX, and Windows, and significant improvements have been made to its functionality in every release since, culminating in the next-generation adaptive compression functionality available in DB2 10.1. Row compression requires a license for the DB2 Storage Optimization Feature.
Beginning with DB2 10.1, two types of row compression methods are available:
- Classic row compression— Refers to the compression technology that has been used:
- For user table data, since DB2 9 for Linux, UNIX, and Windows was introduced
- For XML and temporary data, since DB2 9.7 for Linux, UNIX, and Windows was introduced
- Adaptive row compression— A new compression mode introduced in DB2 10.1 that you can apply to user table data. Adaptive row compression is superior to classic row compression in that it generally achieves better compression and requires less database maintenance to keep the compression ratio near an optimal level.
Classic row compression
Classic row compression relies on a dictionary-based compression algorithm. There is one compression dictionary for each table object. The dictionary contains a mapping of patterns that frequently occur in rows throughout the whole table. This compression dictionary is referred to as the table-level compression dictionary. An example of how classic row compression works can be seen in Figure 1.
Figure 1. How classic row compression works
You can enable row compression on a table-by-table basis. For classic
row compression to occur, you must enable the table for compression, and a
dictionary must exist for the data or XML object. To enable classic row
compression for a table in DB2 10.1 at table creation time, issue
CREATE TABLE ... COMPRESS YES
On the other hand, to enable an existing table for classic row compression,
ALTER TABLE statement that looks
ALTER TABLE ... COMPRESS YES
In DB2 10.1, the
STATIC option in the
COMPRESS YES clause is mandatory in both cases.
In earlier versions of DB2, you use the
COMPRESS YES clause without any further
qualification, as follows:
CREATE TABLE ... COMPRESS YES ALTER TABLE ... COMPRESS YES
Enabling row compression results in space savings for the vast majority of practical tables; compression ratios are typically 50-80 percent or more. Furthermore, the storage footprint for a table that uses row compression never exceeds that of the uncompressed version of the same table.
Storage space savings from row compression typically translate into fewer physical I/O operations for reading the data in a compressed table, since the same number of rows is stored on fewer physical pages. (Compression allows more data rows to be packed into the same number of pages, and buffer pool hit ratios increase.) In many cases, I/O savings and improved buffer pool utilization results in higher throughput and faster query execution times.
Adaptive row compression
Like classic row compression, adaptive row compression relies on a dictionary-based compression algorithm. The difference is that there can be multiple compression dictionaries for each table object. Each dictionary contains a mapping of patterns that frequently occur in rows throughout a single page. Consequently, these compression dictionaries are referred to as page-level compression dictionaries. Figure 2 shows an example of how adaptive row compression works. Adaptive compression not only yields significantly better compression ratios in many cases but can also adapt to changing data characteristics.
Figure 2. How adaptive row compression works
As with classic row compression, you enable adaptive row compression on a table-by-table basis. In DB2 10.1, to create a table that has adaptive row compression enabled, you simply issue one of the following statements:
CREATE TABLE ... COMPRESS YES ADAPTIVE CREATE TABLE ... COMPRESS YES
On the other hand, to enable an existing table for adaptive row
compression, execute an
ALTER TABLE statement
that looks like this:
ALTER TABLE ... COMPRESS YES ADAPTIVE ALTER TABLE ... COMPRESS YES
In DB2 10.1, adaptive row compression is the default type of row
compression used. Therefore, the
option is the default value for the
COMPRESS YES clause.
When you upgrade a database from an earlier version of DB2 for Linux, UNIX,
and Windows, existing tables that had classic row compression enabled keep
their compression settings, and classic row compression is used.
Consequently, if you want to enable these tables for adaptive row
compression, you must use one of the
ALTER TABLE statements shown earlier.
As mentioned, adaptive row compression builds on classic row compression; table-level compression dictionaries are still used. A table-level dictionary is complemented by page-level compression dictionaries, which contain entries for frequently occurring patterns within a single page. The table-level dictionary helps eliminate repeating patterns in a global scope, while page-level dictionaries take care of locally repeating patterns found on individual pages. How this combination of both table-level and page-level dictionaries are used is illustrated in Figure 3.
Figure 3. How table-level and page-level compression dictionaries work together when adaptive compression is used
Page-level dictionaries are maintained automatically. When a page becomes filled with data, the DB2 database manager builds a page-level compression dictionary for the data in that page. Over time, the DB2 database manager automatically determines when to rebuild the dictionary for pages where data patterns have changed significantly. as a result, the use of adaptive compression not only yields higher overall compression savings, but it also ensures that compression ratios do not degenerate as much over time as with classic row compression. In many practical cases, the compression ratio remains nearly optimal over time. Thus, by using adaptive row compression, you can reduce the cost associated with monitoring compression ratios of tables and performing the necessary maintenance (classic, offline table reorganization) to improve storage utilization.
Revealing compression settings
By examining the
COMPRESSION column in the
SYSCAT.TABLES system catalog view, you can
determine whether a table has row compression enabled and which row format
it uses. Possible values for this column are as follows:
- V.— Uses the alternative row format and no row compression
- R.— Uses row compression and the standard row format
- B.— Uses the alternative row format and row compression
- N.— Does not use row compression but uses the standard row format
With DB2 10.1, you can determine which row compression type is used
on your tables by examining the
column in the
SYSCAT.TABLES system catalog
view. Possible values for this column:
- S.— Uses classic row compression
- A.— Uses adaptive row compression
- Blank.— Is not enabled for row compression
ROWCOMPMODE column of the
SYSCAT.TABLES system catalog view does not
exist in earlier versions of DB2. In this case, all tables that have row
compression enabled implicitly use classic row compression.
If you have DB2 10.1 for Linux, UNIX, and Windows installed, the following query can be used to obtain the compression settings for all user tables that have been defined.
Listing 1. Obtaining compression settings with a query
SELECT SUBSTR(TABSCHEMA, 1, 10) AS TABSCHEMA, SUBSTR(TABNAME, 1, 10) AS TABNAME, COMPRESSION, ROWCOMPMODE FROM SYSCAT.TABLES WHERE TABSCHEMA NOT LIKE 'SYS%'
When this query is executed, you should see results that look something like Listing 2.
Listing 2. Query results showing compression settings
Sample results are as follows: TABSCHEMA TABNAME COMPRESSION ROWCOMPMODE ---------- ---------- ----------- ----------- DB2INST1 ACCTCR R S DB2INST1 BKPF R A DB2INST1 BSIS B A DB2INST1 CDCLS N DB2INST1 CDHDR V DB2INST1 COSP B S 6 record(s) selected.
In this example, the ACCTCR, BKPF, BSIS, and COSP tables are all enabled for row compression. The ACCTCR and COSP tables use classic row compression, and the BKPF and BSIS tables use adaptive row compression. The BSIS and COSP tables also use the alternative row format, whereas the ACCTCR and BKPF tables use the standard row format. The CDHDR table uses the alternative row format without row compression, and the CDCLS table uses the standard row format without row compression.
Building a compression dictionary
Before classic row compression can take effect in a table that has been
enabled for compression, a dictionary must exist for the table. With DB2
9, you had to be conscious of whether a compression dictionary
existed and you had to initiate the creation of a
table-level dictionary if necessary. A table-level compression dictionary can be built
(and existing data in a table can be compressed) by performing an offline
(classic) table reorganization operation. Such an operation is initiated
by executing the
REORG command with the
KEEPDICTIONARY or the
RESETDICTIONARY option specified. If the
REORG command is executed with either option
specified, and a compression dictionary does not exist, a new dictionary
will be created. On the other hand, if the
REORG command is executed with either option
specified, and a compression dictionary already exists, the existing
dictionary will either be recreated
RESETDICTIONARY) or left as it is
KEEPDICTIONARY) and data in the table will be
reorganized and compressed.
To create a new compression dictionary for a table named EMPLOYEE that has been enabled for deep compression, (as well as reorganize and compress the table's data) you would execute the following command:
REORG TABLE employee RESETDICTIONARY
When this command is executed, data stored in the EMPLOYEE table will be analyzed, a compression dictionary will be constructed and stored at the beginning of the table, and all existing data will be compressed and written to the table directly behind the compression dictionary. Figure 4 shows how the EMPLOYEE table would look before and after deep compression is applied.
Figure 4. How data in a table is altered when a compression dictionary is built and classic row compression is applied
It is important to note that because records in a compressed table are moved between storage and memory in compressed form, records for compressed tables written to transaction log files are compressed as well. (The compression dictionary is loaded into memory when the table is accessed so compression and decompression can take place.)
Automatic Dictionary Creation (ADC)
As you might imagine, optimal compression ratios are achieved when a compression dictionary is built from an all-inclusive set of data. Consequently, when a compression dictionary is built by reorganizing a table, a high compression ratio results because every row in the table is used. However, testing has shown that a good compression ratio is also possible when just a small amount of representative data is analyzed. (In some cases, evaluating less than 1 percent of the total number of rows available has yielded a compression ratio of 45 percent.) This concept forms the basis for a compression feature that was added in DB2 9.5 called Automatic Dictionary Creation (ADC).
In DB2 9.5 and later, if a table is enabled for compression at the time it's created, ADC will cause a compression dictionary to be built automatically once a sufficient amount of data has been stored in the table. The threshold at which ADC kicks in and begins constructing the compression dictionary depends on the table's row size. Dictionary construction typically begins when 1-2 MB of pages have been allocated to the table. At that point, ADC will check to see how much user data is contained within the table; if at least 700 KB of data is present, a compression dictionary will be built. (Note that these values are set internally and can't be altered.) Operations that can trigger ADC include inserts, imports, loads, and the redistribution of data across partitions.
Unlike the compression dictionary built by a
REORG operation, the dictionary created by way
of ADC is stored in the table at the end of the existing data. The
table's pre-existing records remain uncompressed until an offline
table reorganization operation is performed or until the pre-existing
records are updated (in which case, each record modified is compressed when
the changes are saved). New records are compressed as they're added.
(One goal of ADC is to build a compression dictionary that will
yield a decent compression ratio without leaving a large amount of
uncompressed data in the table.) Figure 5 shows what a table
enabled for compression during creation would look like before, during,
and after a compression dictionary is built by ADC.
Figure 5. How ADC works
When compression is enabled for a table populated (by
COMPRESS attribute to
ON), a compression dictionary isn't
created automatically. Rather, the next time a table growth action occurs,
ADC will be triggered, and a small number of records at the beginning of
the table will be used to construct a compression dictionary for the
entire table. Once the dictionary is created, data added to the table by
subsequent insert, import, load, and redistribution operations will be
compressed; pre-existing data will remain uncompressed.
As you can see, the automatic creation of a compression dictionary is controlled, in part, by a table's compression attribute. To prevent ADC behavior, don't enable a table for compression until you're ready to manually build a compression dictionary and compress the data. On the other hand, if you elect to take advantage of ADC, remember that the compression ratio for the dictionary produced may not be as optimal as one created by an offline table reorganization. Also, because the table will remain online while the compression dictionary is built, the transaction that causes ADC to be initiated will experience a slight negative impact on performance when the threshold is crossed and ADC is triggered.
It is important to note that dictionaries built with ADC might, over time, yield lower storage space savings than those built using a classic table reorganization. Also, over time, the table-level dictionary for a table whose data is frequently updated might no longer contain the most efficient patterns for the changing data, leading to a degradation of the compression ratio. In such cases, regular classic table reorganizations may be required to maintain consistently high storage space savings.
Identifying candidate tables for row compression
If you are not already using row compression, you might want to examine your database to determine which tables might benefit from having their data compressed. Because data compression helps you initially save storage on existing uncompressed tables and optimize storage growth in the future, you can find your storage pain points by examining existing tables that contain a significant amount of data or tables for which you anticipate substantial growth over time.
Naturally, the largest tables are obvious candidates for compression, but don't overlook smaller tables. If you have hundreds or thousands of small tables, you may benefit from the aggregate effect of compression. Of course, "Large" and "small" are relative terms; your database design determines whether tables of a million or several million rows are considered large or small.
The following query uses the SQL administration function
ADMIN_GET_TAB_INFO() to return an ordered list
of all table names and table data object sizes for a particular schema. To
display the current compression settings and row compression modes in use,
the result set produced has been joined with the
SYSCAT.TABLES view, and data in the
ROWCOMPMODE columns has been returned.
Listing 3. Query to look for candidate tables for compression
SELECT SUBSTR(T.TABSCHEMA, 1, 10) AS TABSCHEMA, SUBSTR(T.TABNAME, 1, 10) AS TABNAME, SUM(TI.DATA_OBJECT_P_SIZE)/1024/1024 AS STORAGESIZE_GB, T.COMPRESSION AS COMPRESSION, T.ROWCOMPMODE AS ROWCOMPMODE FROM TABLE (SYSPROC.ADMIN_GET_TAB_INFO('DB2INST1', '')) TI JOIN SYSCAT.TABLES T ON T.TABSCHEMA = TI.TABSCHEMA AND T.TABNAME = TI.TABNAME GROUP BY T.TABSCHEMA, T.TABNAME, T.COMPRESSION, T.ROWCOMPMODE ORDER BY STORAGESIZE_GB DESC
This query identifies the top space consumers in your current database and gives you a list of candidate tables you should start working with. In many practical scenarios, the bulk of the storage space of a database is occupied by relatively few tables.
After you determine candidate tables based on their storage consumption, consider the typical SQL activity against the data in those tables:
- Tables that are read only are excellent candidates for row compression. Tables with a read/write ratio of 70 percent or more reads and 30 percent or less writes are good candidates for row compression.
- Tables that experience only a limited number of updates are likely to be good candidates for row compression.
- Tables that undergo heavy updates might not be as good candidates for row compression.
- Large tables accessed mostly through table scans rather than index scans are good candidates for compression. This generally includes large fact tables in data warehouses, where a significant number of queries perform heavy aggregations. The I/O savings and the increased buffer pool utilization that result from applying compression are likely to improve query performance for such tables.
Whether it is better to use classic or adaptive row compression depends less on the data access patterns than on the actual compression savings that you can achieve by using either type of compression. You make the decision about which row compression mode to choose later in the process.
Row compression performs best in I/O or memory-bound environments where the workload is not bottlenecked on the CPU. Extra CPU cycles are required to perform row compression and expansion of data rows whenever they are accessed or modified. This overhead can be offset by efficiencies that are gained in doing fewer I/O operations. Row compression works very well with decision-support workloads comprising complex analytic queries that perform massive aggregations where row access is mostly sequential and less random.
Estimating storage savings from using row compression
After you have created a list of candidate tables for compression, based on storage consumption and data access characteristics, the next step is to determine the storage space savings you can expect to achieve if those tables are indeed compressed. You can estimate the space savings for any table before you enable it for row compression. Furthermore, you can estimate storage space savings even if you do not have a license for the DB2 Storage Optimization Feature.
As with the compression functionality in general, the mechanisms for compression estimation have evolved over time, with the primary goal being to provide faster and simpler ways to perform this task. The mechanism of choice depends on which version of the DB2 software you are using. All the tools and functions available in older releases are also available in DB2 10.1. However, you might find that the improved counterparts in DB2 10.1 are easier to use or faster to respond with the data you are looking for.
Estimating row compression savings in DB2 10.1
In DB2 10.1, the preferred way to estimate compression ratios is
by using the
administrative function. You can use this to estimate compression
savings for a particular table, to estimate compression savings for all
tables in a particular schema, or to estimate compression savings for all
tables in a database. In addition, this function calculates current
compression savings, along with savings projections, for classic and
adaptive row compression.
The syntax for the
table function is
(TableSchema, TableName) where:
- TableSchema identifies by name the schema where the table that compression information is to be obtained for resides. (If this parameter contains an empty or NULL value, all schemas are assumed.)
- TableName identifies by name the table that compression information is to be obtained for. (If this parameter contains an empty or NULL value, all table names are assumed.)
If you do not specify a table name, this function computes estimates for all tables in a particular schema; if you do not specify a table name or a schema, it computes estimates for all tables in a database. (If your schema or database contains hundreds or thousands of tables, the processing time might be substantial; in this case, try restrict your queries to compute only estimates for those tables for which you are considering compression.)
The structure of the table that is returned by the
ADMIN_GET_TAB_COMPRESS_INFO() function can be
seen in Table 1.
Table 1. Table returned by the ADMIN_GET_TAB_COMPRESS_INFO() function
|Column Name||Data Type||Description|
|DBPARTITIONNUM||SMALLINT||Database partition number|
|OBJECT_TYPE||VARCHAR(4)||The type of object for which compression information is being
reported, which can be one of the following: |
|ROWCOMPMODE||CHAR(1)||The current row compression mode for the object, which can be one
of the following: |
|PCTPAGESSAVED_CURRENT||SMALLINT||Current percentage of pages saved from row compression|
|AVGROWSIZE_CURRENT||SMALLINT||Current average record length|
|PCTPAGESSAVED_STATIC||SMALLINT||Estimated percentage of pages saved from classic row compression|
|AVGROWSIZE_STATIC||SMALLINT||Estimated average record length from classic row compression|
|PCTPAGESSAVED_ADAPTIVE||SMALLINT||Estimated percentage of pages saved from adaptive row compression|
|AVGROWSIZE_ADAPTIVE||SMALLINT||Estimated average record length from adaptive row compression|
Thus, if you wanted to get an estimate of how much storage space will be saved if the data in every table in a schema named DB2INST1 is compressed, you could do so by executing a query that looks like Listing 4.
Listing 4. Query to estimate storage savings
SELECT SUBSTR(TABNAME,1,10) AS TABNAME, PCTPAGESSAVED_CURRENT, PCTPAGESSAVED_STATIC, PCTPAGESSAVED_ADAPTIVE FROM TABLE(SYSPROC.ADMIN_GET_TAB_COMPRESS_INFO('DB2INST1', ''))
Using the result of this query, you can easily determine whether to enable row compression for a particular table and which row compression mode you should use — for example, if the previous query returned a result set that looked like Listing 5.
Listing 5. Sample results
TABNAME PCTPAGESSAVED_CURRENT PCTPAGESSAVED_STATIC PCTPAGESSAVED_ADAPTIVE ---------- --------------------- -------------------- ---------------------- ACCTCR 0 68 72 BKPF 0 83 90 BSIS 0 82 90 CDCLS 0 11 17 CDHDR 0 70 73 COSP 0 87 91 6 record(s) selected.
Of the six tables for which compression estimates were computed, five show very good compression potential. The CDCLS table is the only table that may not benefit from compression. And in every case, it appears that adaptive row compression will yield better results than classic row compression.
Calculating the differences between the values of the PCTPAGESSAVED_STATIC and PCTPAGESSAVED_ADAPTIVE columns helps you determine the best choice for the row compression mode. However, be aware that the scale of the value of the PCTPAGESSAVED_CURRENT column is not linear. For example, consider the ACCTCR and COSP tables shown in the previous example. For the ACCTCR table, you can expect 68-percent space savings with classic row compression and 72-percent space savings with adaptive row compression. For the COSP table, the estimated savings are 87 percent with classic row compression and 91 percent with adaptive row compression. Although the absolute difference between both types is 4 percent for either table, the relative savings that adaptive row compression can achieve over classic row compression are different. Assume that each of the tables is 100 GB in size when compression is not enabled. For the ACCTCR table, the estimated sizes are 32 GB with classic row compression and 28 GB with adaptive row compression, which constitutes a difference of approximately 12.5 percent. However, the estimated sizes for the COSP table are 13 GB with classic row compression and 9 GB with adaptive row compression, which is a difference of approximately 30 percent.
In the previous example, the COSP table is a good candidate for adaptive row compression, but the ACCTCR table might not be a good candidate. Also, the estimated values indicate that adaptive compression would reduce the storage size for table CDHDR by only another 10 percent over classic row compression. Consequently, you might not want to enable adaptive compression for the CDHDR and ACCTCR tables unless you expect the data characteristics to change significantly or you expect lots of new data to be inserted. For the rest of the tables, adaptive compression might be the better choice because of the significant additional storage space savings of up to 45 percent.
Estimating row compression savings in DB2 9.7 and 9.5
In DB2 9.7 and 9.5, administrative functions that are very similar
that comes with DB2 10.1 are available, although they have
slightly different names and signatures. In DB2 9.7, the function
and in DB2 9.5, its name is
ADMIN_GET_TAB_COMPRESS_INFO(). Unlike the DB2
10.1 function that takes two input parameters, the DB2 9.5
and 9.7 functions take three. The third parameter is the execution
mode, and for compression estimation, you pass the string
'ESTIMATE' as the value for this parameter. The structure of the
tables returned by the DB2 9.5 and 9.7 functions also
differ significantly from the structure of the table produced by the DB2
10.1 function; the structure of the table returned by the
DB2 9.5 and 9.7 functions can be seen in Table 2.
Table 2. Table returned by the DB2 9.5 and 9.7 functions
|Column name||Data type||Description|
|DBPARTITIONNUM||SMALLINT||Database partition number|
|DATA_PARTITION_ID||INTEGER||Data partition number|
|COMPRESS_ATTR||CHAR(1)||The state of the COMPRESS attribute on the table,
which can be one of the following: |
|DICT_BUILDER||VARCHAR(30)||Code path taken to build the compression dictionary, which can be
one of the following: |
|DICT_BUILD_TIMESTAMP||TIMESTAMP||Date and time the compression dictionary was built; if no dictionary is available, the timestamp is NULL|
|COMPRESS_DICT_SIZE||BIGINT||Size of compression dictionary, measured in bytes|
|EXPAND_DICT_SIZE||BIGINT||Size of expansion dictionary, measured in bytes|
|ROWS_SAMPLED||INTEGER||Number of records that contributed to building the dictionary; migrated tables with compression dictionaries will return NULL in this column|
|PAGES_SAVED_PERCENT||SMALLINT||Percentage of pages saved from compression. This information is based on the record data in the sample buffer only. Migrated tables with compression dictionaries will return NULL in this column.|
|BYTES_SAVED_PERCENT||SMALLINT||Percentage of bytes saved from compression. This information is based on the record data in the sample buffer only. Migrated tables with compression dictionaries will return NULL in this column.|
|AVG_COMPRESS_REC_LENGTH||SMALLINT||Average compressed record length of the records contributing to building the dictionary. Migrated tables with compression dictionaries will return NULL in this column.|
Furthermore, the processing time for the administrative functions might be significantly longer than in DB2 10.1 because a full table scan is performed for each table to compute the compression estimates.
Estimating row compression savings in DB2 9
DB2 9, the first DB2 release to support row compression, does not
provide any of the administrative functions introduced in later
releases. However, if you are using this release, you can still determine
which tables will benefit the most from row compression by using the
Inspect utility. Invoked by executing the
INSPECT command with the
ROWCOMPESTIMATE option specified, the Inspect
utility will examine each row in a particular table, build a compression
dictionary from the data found, and use this dictionary to estimate
how much space will be saved if the data in the table is compressed.
For example, to estimate how much storage space will be saved if the data
in a table named EMPLOYEE is compressed, you would execute an
INSPECT command like this:
INSPECT ROWCOMPESTIMATE TABLE NAME employee RESULTS employee.rowcompestimate.out
Running this with the
specified will cause the Inspect utility to save the output produced in a
file that resides in the diagnostic data directory. However, before you
can examine the information stored in this file, you must format it using
The formatted file produced will contain information about the estimated compression savings that, when viewed, will look something like Listing 6.
Listing 6. Estimating row compression savings
DATABASE: TEST VERSION : SQL10010 2011-11-07-184.108.40.2065355 Action: ROWCOMPESTIMATE TABLE Schema name: DB2INST1 Table name: EMPLOYEE Tablespace ID: 4 Object ID: 8 Result file name: employee.rowcompestimate.out Table phase start [...] Data phase start. Object: 8 Tablespace: 4 Row compression estimate results: Percentage of pages saved from compression: 68 Percentage of bytes saved from compression: 68 Compression dictionary size: 32640 bytes. Expansion dictionary size: 32768 bytes. Data phase end. Table phase end. Processing has completed. 2011-11-07-18.40.08.349345
If a table is enabled for deep compression (i.e., the
COMPRESS attribute is set to
YES) before the
INSPECT command is executed, the compression
dictionary that is built and used to estimate space savings will be
written to the table, at the end of the existing data — provided a
compression dictionary doesn't already exist. (Otherwise, the
compression dictionary created will be destroyed.) Figure 6 shows a table
before and after the Inspect utility was used to estimate storage savings
if it was enabled for deep compression before the estimate was
Figure 6. How data in a table is altered when a table enabled for deep compression is evaluated by the Inspect utility
Regardless of the method used to create it, once a compression dictionary has been constructed and written to a table, new records added to that table will automatically be compressed. If the compression dictionary was created by the Inspect utility, pre-existing records in the table will remain uncompressed until an offline table reorganization operation is performed or the pre-existing records are updated (in which case each record modified will be compressed).
In later DB2 releases, you should not use the Inspect utility to build a table-level dictionary but should rely on ADC instead.
You can apply compression to indices if you are using DB2 9.7 or later and have a license for the DB2 Storage Optimization Feature. Compressing indices results in fewer leaf pages, which helps reduce the depth of the index tree, in many cases. This allows for fewer index page accesses to find a specific key and, as in the case of row compression, allows for a higher utilization of buffer pools and requires fewer physical I/O operations. In many practical cases, index compression can significantly improve query performance.
You can enable index compression for each index individually. When you create an index, the index inherits its compression setting from the underlying table. That is, all indices you create on tables enabled for classic or adaptive row compression are also compressed. Therefore, you may find it beneficial to create tables with compression enabled or alter tables to enable row compression before you create any indices on them.
Index compression uses a combination of three techniques to reduce the amount of data stored on disk. Space savings are achieved by dynamically adjusting the number of index keys, rather than by reserving space for the highest possible number of keys that can be stored in a page. The index entries, each of which consists of an index key and a RID list, are compressed by eliminating redundant prefixes from the keys in a single page. For duplicate keys, the associated RIDs in each list are compressed by applying a delta encoding.
Like adaptive row compression and unlike classic row compression, index compression is fully automatic. Optimal compression savings are maintained over time, and there is no need to monitor and potentially reorganize indexes to improve compression ratios.
Identifying candidate indexes for index compression
The best candidates for compression are indices with long keys (for example, multicolumn indexes or indexes on character data) and indices that contain many duplicate keys (for example, ones based on a single or a limited number of low-cardinality columns). In addition, indices that involve variable-length columns (for example, VARCHAR columns) might benefit from the better space utilization that dynamic index page space management offers.
Multicolumn indexes generally compress better if the leading key columns have low to medium cardinality, and the higher-cardinality or unique columns are toward the end of the key column list. Depending on the characteristics of the queries you expect your indices to support, column reordering might be possible without sacrificing query performance. It may also be worthwhile to check the order and cardinalities for the columns of your larger indices, especially in data warehousing scenarios. In these cases, the majority of queries aggregate values over a larger number of qualifying rows, and query characteristics are generally more predictable than in OLTP workloads.
Examples of indices that might benefit less from compression are those on a single unique numeric column or unique multicolumn ones that have the highest-cardinality column as the leading index column.
Estimating storage space savings for index compression
When deciding on which indices to compress, estimation of storage space savings plays a bigger role than it does for tables. The computational overhead associated with index compression is less than for row compression, and storage space savings generally correspond more directly to overall performance improvements. If you can achieve significant space savings for any of your indices in an OLTP workload, it is a good idea to compress them.
Similar to the
can be used to estimate storage space savings for index
compression. This can be used to estimate compression savings for
a single index, for all indices on a particular table, for all indices in
a particular schema, or for all indices in a database; the execution mode
is controlled by the first three input parameters. The syntax for the
ADMIN_GET_INDEX_COMPRESS_INFO() table function
ADMIN_GET_INDEX_COMPRESS_INFO (ObjectType, ObjectSchema, ObjectName,
- ObjectType identifies the type of object that compression
savings information is to be obtained for. This parameter must be
assigned one of the following case-sensitive values:
- 'T', NULL, or an empty string to indicate a table
- 'I' for an index
- ObjectSchema identifies by name the schema where the object resides.
- ObjectName identifies by name the object that compression information is to be obtained for resides.
- Member identifies a database member number. (When a value is provided for this parameter, information is only returned for indices that reside on the specified database member.)
- DataPartitionID identifies the data partition ID. (When a value is provided for this parameter, information is only returned for index partitions that have been defined on the data partition specified.)
If you want to estimate space savings for a single index or for all indices
in a particular schema, specify an object type of
'I'. To estimate space savings for all indices on a particular table or on
all tables in a schema or database, specify an object type of
'T', NULL, or an empty string. Depending on the value of the
object type parameter, you can use the object name and object schema
parameters to identify the table or index to analyze.
Alternatively, you can leave the values of the object name and object
schema parameters as NULL or empty strings to estimate compression savings
for a range of indices.
The structure of the table returned by the
ADMIN_GET_INDEX_COMPRESS_INFO() function is
shown in Table 3.
Table 3. Table returned by the
|Column name||Data type||Description|
|INDSCHEMA||VARCHAR(128)||Index schema name|
|TABSCHEMA||VARCHAR(128)||Table schema name|
|DBPARTITIONNUM||SMALLINT||Database partition number|
|DATAPARTITIONID||INTEGER||Data partition ID|
|COMPRESS_ATTR||CHAR(1)||The state of the |
|INDEX_COMPRESSED||CHAR(1)||The physical index format, which can be one of the following:
|PCT_PAGES_SAVED||SMALLINT||Percentage of pages saved from compression. If the index is not
physically compressed (|
|NUM_LEAF_PAGES_SAVED||SMALLINT||The number of leaf pages saved. If the index is not physically
Thus, if you wanted to get an estimate of how much storage space will be saved if the data in every index on a table named EMPLOYEE (in a schema named DB2INST1) is compressed, you could do so by executing a query that looks like Listing 7.
Listing 7. Example query to obtain index compression
SELECT SUBSTR(INDNAME, 1, 20) AS INDNAME, COMPRESS_ATTR, PCT_PAGES_SAVED FROM TABLE(SYSPROC.ADMIN_GET_INDEX_COMPRESS_INFO('T', 'DB2INST1', 'EMPLOYEE', NULL, NULL))
Using the result of this query, you can easily determine whether to enable compression for a particular index. For example, if the previous query returned a result set that looked like Listing 8:
Listing 8. Query results for index compression
INDNAME COMPRESS_ATTR PCT_PAGES_SAVED -------------------- ------------- --------------- EMPLOYEE~0 N 46 EMPLOYEE~1 N 57 EMPLOYEE~2 N 71 EMPLOYEE~3 N 71 EMPLOYEE~4 N 45 EMPLOYEE~5 N 71 EMPLOYEE~6 N 70 EMPLOYEE~BUT N 71 8 record(s) selected.
it would be easy to see that most of the indexes on the EMPLOYEE table compress very well, with as much as a 70-percent savings in storage space.
If you use the
function to analyze indices already compressed, the compression
savings returned reflect the actual savings gleaned, rather than
estimates. Alternatively, you can obtain information about the current
savings realized by index compression by examining the
PCTPAGESSAVED column of the SYSCAT.TABLES catalog view. (The RUNSTATS
command maintains the value of this column.)
Compression of temporary tables
Starting with DB2 9.7, if you have a license for the DB2 Storage Optimization Feature, you can apply compression to temporary tables. Unlike with row and index compression, you do not have to enable temporary tables for compression. Compression is done automatically and applies to user-defined and system temporary tables. (User-defined global temporary tables come in two variants: created global temporary tables (CGTTs) and declared global temporary tables (DGTTs). System temporary tables are used by some utilities and maintenance operations, such as table reorganization and data redistribution. During query processing, the DB2 database manager may also create and use temporary tables for operations that must accumulate intermediate results, such as sorts, hash joins, or table queues.)
The mechanism used to compress temporary tables is the same as that used for classic row compression with ADC, although the runtime behavior is slightly different from permanent tables. Most temporary tables, especially small ones, do not cause any physical I/O. Thus, the threshold at which the compression dictionary is built is 100 MB instead of 2 MB. The higher threshold ensures that small temporary tables that typically remain fully buffered are not compressed, but that larger temporary tables that might spill to disk contain compressed data. In addition to avoiding physical I/O, compression for temporary tables ensures that large temporary tables use buffer pools more efficiently, which helps to further avoid physical disk I/O.
After you determine which set of tables and indices to compress, the next step is to enable these tables and indices for compression. Changing the compression settings for existing tables already populated helps slow their growth rate, and data is being inserted benefits from the new compression settings. If your goal is to reduce the footprint of your existing data, perhaps because you want to reduce the physical space that is utilized by your database, you must apply compression to this data.
You can use the following strategies to help apply compression to large amounts of existing data, free up disk space currently consumed by that data, and release this space to the file system. You will find this information most useful when you first implement row or index compression or when you are upgrading to DB2 10.1 from earlier releases of DB2.
Applying compression to existing data
The most straightforward way to apply compression to existing data is by
performing a classic table reorganization. If you changed the settings for
row or value compression, execute the
REORG TABLE command with the
RESETDICTIONARY specified. If some of your
tables have XML columns, you should specify the
LONGLOBDATA clause as well to compress the
non-inlined XML documents. If you changed the compression settings only
for indexes, the
REORG INDEXES ALL command is
sufficient to compress the index data. A full reorganization of the
underlying table's data is not necessary. You might consider
performing an index reorganization, rather than a full reorganization of
all table data if, for example, you are upgrading from DB2 9.5 to
DB2 10.1 and already have row compression enabled.
If you cannot take your tables offline, consider using the
ADMIN_MOVE_TABLE() stored procedure, available
since DB2 9.7, as well. You can use this procedure
to move data stored in an active table into a new data table object with
the same name, that may or may not reside in the same storage location.
ADMIN_MOVE_TABLE() procedure provides a way
to effectively reorganize a table in the same way that a classic table
reorganization would, but without requiring any downtime — the data
remains online and accessible for the duration of the move.
Table-level compression dictionaries that are built (or rebuilt) by the
ADMIN_MOVE_TABLE() procedure are of the same
quality as those constructed as during a table reorganization. To
build or rebuild a compression dictionary, the
ADMIN_MOVE_TABLE() procedure performs the
- Collects a Bernoulli sample of all the rows in the table.
- Builds a new table-level compression dictionary from that sample.
- Inserts the dictionary into the target table before starting the copy phase, so rows are compressed when inserted into the target table during the copy phase.
The following example shows how you initiate an online table move operation in the most straightforward way.
CALL SYSPROC.ADMIN_MOVE_TABLE('DB2INST1','ACCTCR', '','','','','','','','','MOVE')
You can use the
procedure to exercise fine-grain control over parameters and the
behavior of an online table move operation, such as determining the size
of the sample used for dictionary building. However, even if you
invoke the function in the standard
'MOVE' execution mode, the
compression dictionary that is built is optimal.
Unlike for classic table reorganization, the default behavior for the
ADMIN_MOVE_TABLE() procedure is to rebuild the
dictionary. Because only a small random sample of the table data is used,
rebuilding the dictionary typically does not cause an excessive
performance overhead compared to not rebuilding the dictionary.
You might find the
particularly useful when you upgrade your database from an earlier DB2
release. You can use the procedure to move tables between table spaces and
change other parameters, such as column data types, at the same time.
Managing disk space consumption
Applying compression to existing tables reduces the number of pages allocated for those tables. However, applying compression does not immediately affect the disk space consumption of DMS or automatic storage table spaces. Those table spaces merely have a larger number of unused extents available to them.
When you first implement row and index compression, it is a recommended practice to reevaluate the utilization of your table space containers and the projected growth rate of your data. With compression enabled, table space utilization can be significantly reduced — it might be a long time before your database grows enough to consume the space initially freed up. Consequently, you might want to reduce table space container sizes and make more disk space available to other consumers.
To effectively reduce the on-disk footprint of a database, you must reduce the size of the table space containers used by the database. And the key factor that determines how much you can reduce the size of table space containers by is the high water mark. (The high-water mark represents the highest allocated page in a table space; it is a DMS-only concept, which also includes automatic storage table spaces.)
The high-water mark can increase when you create tables, extend existing tables, or perform a classic table reorganization that does not use a temporary table space. You can reduce the high-water mark by dropping or truncating the table that owns the extent at the high-water mark. However, dropping, truncating, or reorganizing other objects does not affect the high-water mark, and merely results in creating more unused extents below the high-water mark. Those unused extents below the high-water mark can be reused only when you create new objects or when objects grow. Only unused extents above the high-water mark can be reclaimed and returned to the operating system. Consequently, the key to reducing the on-disk footprint is to lower the high-water mark and truncate the table space containers. How you perform this task is dependant upon what kind of table spaces you are using, which DB2 release you used to create these table spaces, and which DB2 release you are using. (Techniques for lowering the high-water mark and freeing up space are presented in the Returning unused space to the file system section of this article.)
Verifying row compression and avoiding common pitfalls
As mentioned, you can obtain the current compression settings
for tables and indices by querying the system catalog. In the
SYSCAT.TABLES view, the two columns COMPRESSION and ROWCOMPMODE contain
information about the current compression settings for each table. The
COMPRESSION column in the SYSCAT.INDEXES view provides information about
the compression settings for each index. The
RUNSTATS command populates all these
Other columns in the SYSCAT.TABLES view provide deeper insight into how row compression performs:
- The most important metric is found in the PCTPAGESSAVED column. This indicates how much space you are currently saving on a table by means of row compression.
- From the PCTROWSCOMPRESSED column, you can determine how many rows in the table are compressed.
- The AVGCOMPRESSEDROWSIZE column provides the average size of all the compressed rows. In most practical scenarios, the percentage of compressed rows should be near 100 percent. In this case, the average size of compressed rows is identical to the average row size that is maintained in the AVGROWSIZE column.
If the percentage of compressed rows in a table is significantly lower than 100 percent, this situation might be due to one of a few common issues:
- Your table might be in a regular table space, and compression might
have cause the average row size to be very short. For a table in a
regular table space, the effective minimal row size is limited by the
fact that each page can hold at most 255 rows. If your average row
size is near or below that ratio, consider converting the regular
table space to a large table space. To do so, execute the
ALTER TABLESPACEstatement with the
CONVERT TO LARGEoption specified. Or use the
ADMIN_MOVE_TABLE()procedure to move your table to a large table space.
- If you enable row compression on an existing table with data in it,
the table-level dictionary is built automatically by means of ADC when
the table grows. Existing data remains in its uncompressed form until
you perform a classic table reorganization or you use the
- When you create a table with classic row compression enabled and start populating the table, the table-level dictionary is built automatically when the table size reaches the ADC threshold of 2 MB. However, the rows in the very beginning of the table remain uncompressed. If your table is only a few MB in size, this initial chunk of uncompressed rows can account for a significant fraction of the total rows that are stored in the table.
If none of the previous situations apply, determine whether the table contains random data. For example, a table might have a large CHAR FOR BIT DATA or inlined LOB column that contains binary data compressed in some form by the application layer. In such cases, row compression might not be able to compress the data any further and should not be enabled on that table.
Pay special attention when you apply row compression to MDC tables. Cell size and data density are important considerations for the physical design of MDC tables and play an important role in choosing appropriate dimensions. In some cases, clustering dimensions are chosen so that cells use a very small number of pages. Applying row compression reduces the size of each record and the average size of the cells. If row compression causes the average cell size to be near or below the size of a single block, you might not see storage space savings reach their full potential. This is because space allocation in MDC tables is done blockwise. Partially used blocks consume a full extent on disk but can contain a significant portion of free space. If you know that you will be using row compression, remember to factor in the compression ratio when you do the calculations for the cell size projection. If you decide to apply row compression on existing MDC tables, revalidate the cell size calculations. You might find it necessary to coarsify some of the dimensions or reduce the extent size of your table spaces.
Verify that index compression is enabled for all the indices you
considered for compression. Pay special attention to the system-generated
indices — for example indexes that are implicitly created for primary key
and unique columns. If you create a table without enabling row compression
and subsequently use the
ALTER TABLE statement
to turn row compression on, existing indices created before the
table was enabled for compression will remain uncompressed; you must
explicitly enable index compression for those indices by executing the
ALTER INDEX statement. For those indices that
are enabled for compression, the PCTPAGESSAVED column provides information
about the storage space savings.
Returning unused space to the file system
After reducing the amount of storage space needed to house a table's
data with deep compression, you may wish to resize the table space where
the table resides so any storage space reclaimed can be used
somewhere else. If SMS table spaces are used, any storage space reclaimed
will be returned to the file system automatically as part of the table
reorganization process. However, that's not the case when DMS table
spaces are used. Instead, DMS table spaces (and automatic storage table
spaces, which are, by default DMS table spaces) can be resized by
executing an appropriate form of the
ALTER TABLESPACE SQL statement. For example, to
reduce by 200 MB the amount of storage space allocated for each container
being used by a table space named TBSP1, you would execute an
ALTER TABLESPACE statement like this one:
ALTER TABLESPACE tbsp1 REDUCE (ALL CONTAINERS 200 M).
You could also reclaim space by dropping one or more storage containers as opposed to resizing all of them. However, the amount of storage space that can actually be returned to the file system for redistribution is dependant upon the location of the high-water mark of the table space being resized.
How the high-water mark affects space reallocation
Whenever a row of data is stored in a table, DB2 assigns that row a unique record identifier, known as a RID. Prior to DB2 9, RIDs consisted of a 3-byte page number and a 1-byte slot number. The slot number is an array entry into a slot directory, which in turn contains the offset into the data page where the row's data physically resides; the page number identifies the data page itself. DB2 9 and later has the ability to support 4-byte RIDs and 6-byte RIDs that consist of a 4-byte page number and a 2-byte slot number (tables that reside in a DMS LARGE table space use 6-byte RIDs). As a result, you can now have data pages that contain more than 255 rows (the old limit with 4-byte RIDs) and a single table partition can grow to 2 TB in size when a 4K page is used (16 TB when a 32K page is used). In DB2 9 and later, 6-byte RIDs are used by default for any new automatic storage table spaces and DMS table spaces created.
DB2 indices use table space-relative RIDs to reference base table records. (An index is an ordered set of pointers that refer to rows in a base table.) This means that every key in an index points to slot X in page Y in a table space, not to slot X in page Y of a base table. As a result, DB2 can't freely move extents around in a table space. If it did, RIDs stored in indices would point to the wrong data, and every index that referenced data in the table space would have to be rebuilt any time a move occurred.
Using table space-relative RIDs improves performance because the extra step of figuring out where a particular page of a table resides in a table space, which would have to be done by scanning the table's extent map pages, is eliminated. Because DB2 can't alter the placement of extents in a table space, only unused storage space that comes after the high-water mark can be freed.
How you perform the offline table reorganization operation required to compress a table's data can have a significant effect on a table space's high-water mark. If you use a temporary table space to reorganize and compress a table's data, the resulting version of the table is constructed in the temporary table space, its data pages are copied over the original table object in the original table space, and the storage space used by the table is truncated. Figure 7 shows how a table space containing four tables (named TABLE_1, TABLE_2, TABLE_3, and TABLE_4) might look before and after all four tables in it have been reorganized and compressed using a temporary table space.
Figure 7. How a table space's high-water mark is affected by table reorganization operations that use a temporary table space
Once the data in all four tables has been compressed, more free space is available; however, that free space is scattered throughout the table space and the high-water mark hasn't been lowered by any significant amount.
If a temporary table space isn't used to reorganize and compress a table's data, the new version of the table is created in the same table space as the original table, and the original table is eventually removed. Because of this behavior, the order in which tables are reorganized and compressed, as well as where these tables are laid out in the table space (and unfortunately, there is no easy way to determine this), will have a huge impact on table layout within the table space and the placement of the table space's high-water mark. Figure 8 shows how our original table space containing four tables (TABLE_1, TABLE_2, TABLE_3, and TABLE_4) might look before and after all four tables in it have been reorganized and compressed (without using a temporary table space) in the following order: TABLE_1, TABLE_2, TABLE_3, and TABLE_4.
Figure 8. How a table space's high-water mark is affected by table reorganization operations that do not use a temporary table space
In this case, we end up with a high-water mark higher than it was when we began the reorganization and compression operations. However, if we were to reorganize and compress table TABLE_1 again, the table would be moved to the beginning of the free space in the table space and the high-water mark would be lowered significantly. Figure 9 shows how the table space would look after a second reorganization and compression of TABLE_1 is performed.
Figure 9. How a table space's high-water mark is lowered by performing a second reorganization/compression operation on table TABLE_1
Note that the examples in Figures 7, 8, and 9 are based on the assumption that TABLE_1, TABLE_2, TABLE_3, and TABLE_4 were created and populated one at a time and that the table space TBSP1 has been fully compacted. In reality, this example represents the worst-case scenario. When more than one table exists in a single table space, it is more common to have extents from different tables interleaved throughout the table space. Assessing the degree of extent fragmentation that exists becomes more complex, but is probably less drastic than depicted in these illustrations. Therefore, the actual high-water mark impact made will most likely be significantly less.
When you start adding things like index objects and large objects (LOBs) to
the same table space, things get even more confusing, and it's
difficult to predict exactly what will happen. You can use the command
db2dart /DHWM to show the extent layout for all
tables in a table space and the command
db2dart /LHWM to interpret this layout and
generate a set of steps you can follow to lower the high-water mark.
However, the only way to guarantee that things are laid out as compactly
as possible is to rebuild the entire table space from scratch by unloading
and reloading its tables. Then, and only then, will you be able to return
all of the storage space reclaimed by compression to the operating system
Reclaimable storage in DB2 9.7 and later
Non-temporary DMS and automatic storage table spaces you create in DB2 9.7 or later support reclaimable storage. For those table spaces, the database manager can change the physical location of extents in the containers. This mechanism, called extent movement, can be used to consolidate unused space in table space containers to the physical ends of the containers. After consolidation, the high-water mark can then be lowered, and the table space containers can subsequently be shrunk to the smallest possible size. Figure 10 shows how a table space might look before and after extent movement is used to reclaim unused storage in an automatic storage table space.
Figure 10. How an automatic storage table space's high-water mark is lowered via extent movement
With reclaimable storage, you can use the
ALTER TABLESPACE statement to lower the
high-water mark and shrink table space containers. The process of lowering the
high-water mark involves extent movement and can take some time.
After the high-water mark is lowered, containers can be reduced in size by
returning unused extents above the high-water mark to the file system.
To determine how much the size of a DMS or automatic storage table space
can be reduced, execute the
MON_GET_TABLESPACE() table function, as shown
Listing 9. Query to estimate reduction of DMS or automatic storage space
SELECT SUBSTR(TBSP_NAME, 1, 15) AS TBSP_NAME, TBSP_FREE_PAGES, TBSP_PAGE_SIZE * TBSP_FREE_PAGES / 1024 / 1024 AS TBSP_RECLAIMABLE_MB FROM TABLE(MON_GET_TABLESPACE('TBSP_USR_D_1', NULL))
The following sample result set indicates that the table space TBSP_USR_D_1 can be reduced by about 23 million pages, worth 365 MB in the on-disk footprint.
Listing 10. Query results
TBSP_NAME TBSP_FREE_PAGES TBSP_RECLAIMABLE_MB --------------- -------------------- -------------------- TBSP_USR_D_1 23369760 365150 1 record(s) selected.
You can easily shrink automatic storage table spaces to their most compact
format by using the
ALTER TABLESPACE statement
REDUCE MAX clause:
ALTER TABLESPACE tbsp_usr_d_1 REDUCE MAX.
For DMS table spaces, lowering the high-water mark and reducing containers is a two-step process, as shown below:
ALTER TABLESPACE ts_dat_d LOWER HIGH WATER MARK; ALTER TABLESPACE ts_dat_d REDUCE (ALL CONTAINERS 100 G);
When you upgrade your database from an older DB2 release to DB2 9.7 or later, consider taking the following steps to get the benefits of reclaimable storage:
- Create new table spaces for all your automatic storage or DMS table spaces.
- Enable row and index compression on the tables.
ADMIN_MOVE_TABLE()to move your tables from the old to the new table spaces.
You should enable row and index compression before moving the tables, so they get compressed as part of the move process.
Compression and the Load utility
Another significant change introduced in DB2 9.5 is how load
operations behave when performed against tables enabled for deep
compression. In DB2 9, if a compression dictionary had
been created for a table, the Load utility would use that dictionary to
compress data as it was being loaded. However, if no compression
dictionary was present, the Load utility would not build one as part of
the load operation. Beginning with DB2 9.5, the Load utility can construct
a compression dictionary provided the table being loaded has been enabled
for compression and that a
operation is performed. Such an operation is initiated by executing the
LOAD command with either the
REPLACE KEEPDICTIONARY or the
REPLACE RESETDICTIONARY option specified. (A
LOAD INSERT operation can also result in the
creation of a compression dictionary if the table being loaded has been
configured for compression and the amount of data coming in triggers
LOAD command is executed with the
REPLACE KEEPDICTIONARY or the
REPLACE RESETDICTIONARY option specified, and a
compression dictionary does not exist, a new dictionary will be created.
KEEPDICTIONARY option is used instead,
the amount of data that will be required to build the compression
dictionary is subject to the policies of ADC. Therefore, some of the data
will be stored in the table uncompressed. Once the dictionary is created,
the remaining data that is loaded will be compressed using the new
compression dictionary. On the other hand, if the
RESETDICTIONARY option is specified, the amount
of data required to build the dictionary is not subject to the policies of
ADC and a compression dictionary can be built after loading just one
LOAD command is executed with the
REPLACE KEEPDICTIONARY or the
REPLACE RESETDICTIONARY option specified, and a
compression dictionary already exists, the existing dictionary will either
be recreated (
RESETDICTIONARY) or left as it is
KEEPDICTIONARY) and data in the table will be
compressed using the existing or new dictionary.
To create a new compression dictionary for a
compression-enabled table named EMPLOYEE while performing a load
operation, you would execute a command similar to this:
LOAD FROM datafile.del OF DEL REPLACE RESETDICTIONARY INTO
When this command is executed, assuming no compression dictionary exists for the EMPLOYEE table, a few records found in the file DATAFILE.DEL will be loaded into the EMPLOYEE table uncompressed. As soon as 1-2 MB of data is loaded, ADC will construct a compression dictionary using that data, and the remaining records will be compressed and written to the table directly behind the compression dictionary as they are loaded.
Compression of backup images and log archives
Since DB2 Universal Database 8, you have been able to compress your
backup images. Starting with DB2 10.1, you can also apply
compression to your log archives. These functions are available regardless
of whether you have a license for the DB2 Storage Optimization Feature.
(The default algorithm that is used to compress backup images and archived
logs is similar to the one used by the UNIX®
You can enable backup compression independently for each backup image. When
performing the backup operation, specify the
COMPRESS clause — for example:
BACKUP DATABASE TDB2 to /vol_aux/backups/
If you use row and index compression and you follow the guidelines to reduce the container sizes for your table spaces, the overall size of your database might be significantly reduced. Consequently, your backup images are automatically smaller than they would be if row and index compression were disabled. At the same time, the extra space savings you can achieve by applying compression on backup images might be greatly reduced. This particularly applies when you have many tables that use adaptive row compression. However, compression on backup images can also compress metadata, LOBs, the catalog tables, and other database objects that cannot be compressed by any other means.
When deciding whether to compress your backup images, the most important indicators are the CPU and disk utilization rates while the backup is in progress. Backup compression can cause a significant overhead in CPU consumption. You should use backup compression only if the backup process is bottlenecked on I/O. This can happen, for example, if you store your backup images on a volume that is not striped across different physical disks or if you back up to network-attached or non-local storage media. In those cases, you increase CPU utilization throughout the backup process but the savings in I/O overhead might effectively shorten the backup window.
If you store your backups on Tivoli® Storage Manager (TSM) software, you should use the compression and de-duplication functionality built into TSM. The de-duplication logic can lead to significantly better overall storage space savings if you keep a series of multiple recent backup images.
If only some of your tables use row and index compression, it is a good practice to separate the tables and indices into different table spaces, based on their compression settings. Also, consider taking backups at the table space level rather than at the database level. In this case, you can balance the compression settings and compress only the backups for those table spaces that contain tables and indices that are not compressed.
Archive log compression
If you configured your database for log archiving, you can enable
compression for those log archives through database configuration
parameters. You can enable or disable log archive compression
independently for the primary or secondary archiving method and the
corresponding archive locations. Log archive compression requires the
database manager to handle the archiving process, which is to say that you must set
your archiving method to
following example shows how to set up a primary, disk-based log archive
with compression enabled:
UPDATE DB CFG FOR TDB2 USING LOGARCHMETH1 DISK:/vol_aux/archive/tdb2; UPDATE DB CFG FOR TDB2 USING LOGARCHCOMPR1 ON;
For the secondary archiving method, you can enable compression in a similar fashion.
Log archive compression, after you enable it, is fully automatic. The
database manager automatically compresses log extents when moved
from the active log path to the archive location. Upon retrieval of log
files, which can happen during
ROLLFORWARD operations, the database manager automatically expands compressed log
files when moved from the archive into the active or overflow log
path. If, during recovery, the database manager encounters a compressed
log extent in the active or overflow log path, it automatically expands
that extent. This can happen, for example, if you manually retrieve log
extents from the archive location into the active or overflow log
When deciding whether to enable compression on your log archives, considerations similar to those for backup compression apply. Check the I/O bandwidth available for your archive location and see whether this is the bottleneck during the process of archiving.
Consider enabling compression if your log archives are on non-local volumes or volumes not striped across a number of disks. In addition, consider enabling compression if archiving your logs to an on-disk staging location for subsequent transfer onto tape through the db2tapemgr utility. In this case, compression helps reduce the time that it takes to transfer archived logs to tape.
If archiving to TSM, consider using its built-in compression facilities. The behavior for log archives is similar to that of the compression facilities in DB2 10.1 because the same algorithm is used. Unlike backup compression, there is generally no additional benefit from de-duplication.
When you use row and index compression, some of the data written to the transaction log is already in compressed format. Therefore, the amount of data being logged is less than would be necessary if you performed the same operations on uncompressed tables. However, even if you are widely using row and index compression, there is generally enough metadata and other uncompressed items in the transaction log that you can achieve significant storage space savings on almost any log archive.
The authors would like to thank Bill Minor for his contributions to this article.
- Get more details in "Best practices: Storage optimization with deep compression."
- Get an overview of changes to DB2 10 in "What's new in DB2 10 for Linux, UNIX, and Windows."
- Learn more about improving performance in DB2 10 in "DB2 V10.1 Query performance enhancements."
- Read "Test drive DB2 V10.1 using a virtual appliance" to learn more about the features of DB2 10.
- Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edition and provides a solid base to build and deploy applications.