Optimize storage with deep compression in DB2 10

Using deep compression to minimize storage space and improve database performance

Organizations are generating more data now than at any other time in history. And the need to comply with legal and governmental regulations means that they're keeping that data around for longer periods of time. As a result, databases are growing at an astonishing rate. In fact, according to industry analysts, enterprise databases are growing 125 percent annually. This explosion in data volume places tremendous pressure on enterprises to store, protect, distribute, and derive value from all the data being generated. In 2007, IBM® responded to this pressure by introducing new compression technology, known as deep compression, in DB2® 9. Since then, IBM has improved this technology in subsequent releases of DB2. This article describes the various compression methods available in DB2 10.1 and provides recommended best practices to help you achieve maximum storage space savings when you adopt any of the compression techniques available.

Share:

Roger E. Sanders (roger_e_sanders@yahoo.com), Senior Consultant Corporate Systems Engineer, EMC Corporation

Roger SandersRoger E. Sanders is the president of Roger Sanders Enterprises Inc. and a senior consultant corporate systems engineer with EMC Corp. He has been working with DB2 for Linux, UNIX, and Windows since DB2 was introduced on the IBM Personal Computer (as part of OS/2 1.3 Extended Edition) and has been working in the storage industry for over a decade. He has written articles for IDUG Solutions Journal and Certification Magazine, authored DB2 tutorials and articles for IBM developerWorks, presented at several International DB2 Users Group (IDUG) and regional DB2 Users Group (RUG) conferences/meetings, taught numerous classes on DB2 Fundamentals and DB2 for LUW database administration, and is the author of 21 books on DB2 and one book on ODBC. For the past nine years, he has authored the Distributed DBA column in IBM Data Management Magazine (formerly DB2 Magazine), and he has helped IBM develop 18 DB2 for LUW certification exams. In 2008-2012, he was recognized as an IBM Champion (for data management); in 2010, he was recognized as an IBM developerWorks Contributing Author; and in 2011, he was recognized as an IBM developerWorks Professional Author.


developerWorks Master author level
        2

Thomas Fanghaenel (fanghaen@us.ibm.com), Senior Software Engineer, IBM

Author photoThomas Fanghaenel has been with IBM for nine years and is a Senior Software Engineer with the DB2 Data Warehouse Storage and Indexing Development team. For the past five years he worked in DB2 for Linux, UNIX and Windows development, focusing on compression in both the relational and native XML storage engines. Thomas led the development team for Adaptive Row Compression in DB2 10. Prior to his involvement in DB2 for Linux, UNIX, and Windows, Thomas was the technical lead for DB2 Everyplace, IBM's original database offering for mobile and embedded systems. In the past, Thomas has worked at IBM locations in San Jose, CA and Beijing, China. He is currently based in Marlborough, MA.



17 May 2012

Also available in Chinese Vietnamese

The nature of data

Believe it or not, as the volume of data increases, the cardinality of that data tends to drop. Consequently, there just aren't that many truly unique things in the world. Things may be unique when used in combination, but the basic elements themselves aren't all that varied. Consider the periodic table of elements — everything in our world is made up of combinations of a rather small set of elements. Apply the concept to data, and you find the same is true. For instance, according to the last U.S. census, there are about 300 million people in living in the United States. But there are only about 78,800 unique last names, resulting in very low cardinality with huge "clumps" in certain name sets. First names are even worse, coming in at around 6,600 (4,400 unique first names for females and 2,200 for males). The names of cities, streets, and addresses, not to mention product names, descriptions, and attributes, also tend to be highly redundant with low cardinality.

How deep compression works

Deep compression is based on the principle that large amounts of data tend to be highly redundant. It works by searching for repeating patterns in the data and replacing those patterns with 12-bit symbols, which are stored along with the patterns they represent in a static dictionary. Once this dictionary is created, it's stored in the table (or page), along with the compressed data, and loaded into memory whenever data in the table is accessed (to aid in decompression). When deep compression is enabled, DB2 scans an entire table looking for repeating column values and repeating patterns that span multiple columns in a row or on a page. It also looks for repeating patterns that are substrings of a given column. However, just because a repeating pattern is found doesn't mean that the data is automatically compressed. Data is compressed only when storage savings will be realized. In any case, the entire row is stored as a set of 12-bit symbols; rows are never partially compressed.

Deep compression can be applied to primary user data (row data, XML data, and indices), system-generated data (temporary tables), and administrative data (backup images and archived transaction logs). Compression facilities for administrative data are available in all editions of the DB2 software; Advanced compression features for primary user and system-generated data are available through the DB2 Storage Optimization Feature.


The alternative row format

When it comes to creating user-defined tables, there are two row formats you can use —standard and alternative— and the format you choose determines how rows are packed when they are stored on disk. The alternative row format allows for more compact storage of NULL and system default values, as well as zero-length values in columns with variable-length data types. (The variable-length data types recognized by DB2 are VARCHAR, VARGRAPHIC, LONG VARCHAR, LONG VARGRAPHIC, BLOB, CLOB, DBCLOB, and XML.) Consequently, this format is often referred to as NULL and default value compression or just value compression.

With the standard row format, space for fixed-length column values is allocated even if the actual stored value is NULL. Similarly, zero-length values stored in columns with variable-length data types consume a small amount of space. However, with the alternative row format, NULL values in all columns and zero-length values in columns with variable-length data types consume no space.

When compared to the standard row format, the alternative row format also reduces the storage overhead for all other values in columns with variable-length data types. However, the storage consumption for all non-NULL values that are stored in fixed-length columns increases. Formulas for determining the byte counts for values of all supported data types and row formats can be found in the reference material for the CREATE TABLE statement.

Controlling row formating

You can control the row format used on a table-by-table basis. By default, tables use the standard row format. To create a table that uses the alternative row format, simply append the VALUE COMPRESSION clause to the CREATE TABLE statement that's used to create the table. For example: CREATE TABLE ... VALUE COMPRESSION.

To change the row format used by an existing table, issue one of the following statements instead:

  ALTER TABLE ... ACTIVATE VALUE COMPRESSION
  
  ALTER TABLE ... DEACTIVATE VALUE COMPRESSION

If you activate or deactivate value compression for a table, existing data stored in the table is not modified. Instead, rows remain in their current row format unless you apply one or more of the measures presented in the Adoption strategies section of this article.

For tables that use the alternative row format, you can save additional space by enabling default value compression for fixed-length numeric and character columns. This results in system default values (0 for numeric columns and blank for fixed-length character columns) not being materialized in the on-disk representation of a row. You can enable compression of system default values for each column individually by specifying the COMPRESS SYSTEM DEFAULT column option for the column in a CREATE TABLE or ALTER TABLE ... ALTER COLUMN SQL statement.

Deciding on which format to use

Although the standard row format is a good choice in most cases, there are some tables for which the alternative row format yields a more compact storage layout:

  • Sparsely populated tables (i.e., tables that contain many rows with many NULL or system default values) should use the alternative row format. However, storage space requirements increase for a row whenever you update a NULL or system default value to a value other than NULL or the system default. This change generally causes overflow records to occur even in tables with no variable-length columns (i.e., tables in which all rows have the same space requirements when the standard row format is used).
  • Tables having variable-length data types for the vast majority of their columns should use the alternative row format.

For some tables that do not have those characteristics, using the alternative row format can increase storage space requirements. A test in your environment might be worthwhile.

You can use the alternative row format regardless of whether you have a license for the DB2 Storage Optimization Feature. This flexibility allows you to choose the most compact storage layout for each table, even if you are not planning on using row compression. However, if you are planning on using row compression, choosing the more compact row format to start with yields a smaller on-disk footprint of the table in most cases, though the effect is less dramatic or even negligible. This is due to the fact that row compression can compress the materialized NULL and system default values in tables with the standard row format very well.


Row compression

Row compression was introduced with DB2 9 for Linux, UNIX, and Windows, and significant improvements have been made to its functionality in every release since, culminating in the next-generation adaptive compression functionality available in DB2 10.1. Row compression requires a license for the DB2 Storage Optimization Feature.

Beginning with DB2 10.1, two types of row compression methods are available:

  • Classic row compression— Refers to the compression technology that has been used:
    • For user table data, since DB2 9 for Linux, UNIX, and Windows was introduced
    • For XML and temporary data, since DB2 9.7 for Linux, UNIX, and Windows was introduced
  • Adaptive row compression— A new compression mode introduced in DB2 10.1 that you can apply to user table data. Adaptive row compression is superior to classic row compression in that it generally achieves better compression and requires less database maintenance to keep the compression ratio near an optimal level.

Classic row compression

Classic row compression relies on a dictionary-based compression algorithm. There is one compression dictionary for each table object. The dictionary contains a mapping of patterns that frequently occur in rows throughout the whole table. This compression dictionary is referred to as the table-level compression dictionary. An example of how classic row compression works can be seen in Figure 1.

Figure 1. How classic row compression works
Image shows how a table is sotred with a table-level compression dictionary to reduce repeated patterns

You can enable row compression on a table-by-table basis. For classic row compression to occur, you must enable the table for compression, and a dictionary must exist for the data or XML object. To enable classic row compression for a table in DB2 10.1 at table creation time, issue CREATE TABLE ... COMPRESS YES STATIC.

On the other hand, to enable an existing table for classic row compression, execute an ALTER TABLE statement that looks like this: ALTER TABLE ... COMPRESS YES STATIC.

In DB2 10.1, the STATIC option in the COMPRESS YES clause is mandatory in both cases. In earlier versions of DB2, you use the COMPRESS YES clause without any further qualification, as follows:

  CREATE TABLE ... COMPRESS YES
  
  ALTER TABLE ... COMPRESS YES

Enabling row compression results in space savings for the vast majority of practical tables; compression ratios are typically 50-80 percent or more. Furthermore, the storage footprint for a table that uses row compression never exceeds that of the uncompressed version of the same table.

Storage space savings from row compression typically translate into fewer physical I/O operations for reading the data in a compressed table, since the same number of rows is stored on fewer physical pages. (Compression allows more data rows to be packed into the same number of pages, and buffer pool hit ratios increase.) In many cases, I/O savings and improved buffer pool utilization results in higher throughput and faster query execution times.

Adaptive row compression

Like classic row compression, adaptive row compression relies on a dictionary-based compression algorithm. The difference is that there can be multiple compression dictionaries for each table object. Each dictionary contains a mapping of patterns that frequently occur in rows throughout a single page. Consequently, these compression dictionaries are referred to as page-level compression dictionaries. Figure 2 shows an example of how adaptive row compression works. Adaptive compression not only yields significantly better compression ratios in many cases but can also adapt to changing data characteristics.

Figure 2. How adaptive row compression works
Image shows page-level compression dictionaries

As with classic row compression, you enable adaptive row compression on a table-by-table basis. In DB2 10.1, to create a table that has adaptive row compression enabled, you simply issue one of the following statements:

  CREATE TABLE ... COMPRESS YES ADAPTIVE
  
  CREATE TABLE ... COMPRESS YES

On the other hand, to enable an existing table for adaptive row compression, execute an ALTER TABLE statement that looks like this:

  ALTER TABLE ... COMPRESS YES ADAPTIVE
  
  ALTER TABLE ... COMPRESS YES

In DB2 10.1, adaptive row compression is the default type of row compression used. Therefore, the ADAPTIVE option is the default value for the COMPRESS YES clause.

When you upgrade a database from an earlier version of DB2 for Linux, UNIX, and Windows, existing tables that had classic row compression enabled keep their compression settings, and classic row compression is used. Consequently, if you want to enable these tables for adaptive row compression, you must use one of the ALTER TABLE statements shown earlier.

As mentioned, adaptive row compression builds on classic row compression; table-level compression dictionaries are still used. A table-level dictionary is complemented by page-level compression dictionaries, which contain entries for frequently occurring patterns within a single page. The table-level dictionary helps eliminate repeating patterns in a global scope, while page-level dictionaries take care of locally repeating patterns found on individual pages. How this combination of both table-level and page-level dictionaries are used is illustrated in Figure 3.

Figure 3. How table-level and page-level compression dictionaries work together when adaptive compression is used
Image shows combination of 2 types of compression dictionaries for global scope

Page-level dictionaries are maintained automatically. When a page becomes filled with data, the DB2 database manager builds a page-level compression dictionary for the data in that page. Over time, the DB2 database manager automatically determines when to rebuild the dictionary for pages where data patterns have changed significantly. as a result, the use of adaptive compression not only yields higher overall compression savings, but it also ensures that compression ratios do not degenerate as much over time as with classic row compression. In many practical cases, the compression ratio remains nearly optimal over time. Thus, by using adaptive row compression, you can reduce the cost associated with monitoring compression ratios of tables and performing the necessary maintenance (classic, offline table reorganization) to improve storage utilization.


Revealing compression settings

By examining the COMPRESSION column in the SYSCAT.TABLES system catalog view, you can determine whether a table has row compression enabled and which row format it uses. Possible values for this column are as follows:

  • V.— Uses the alternative row format and no row compression
  • R.— Uses row compression and the standard row format
  • B.— Uses the alternative row format and row compression
  • N.— Does not use row compression but uses the standard row format

With DB2 10.1, you can determine which row compression type is used on your tables by examining the ROWCOMPMODE column in the SYSCAT.TABLES system catalog view. Possible values for this column:

  • S.— Uses classic row compression
  • A.— Uses adaptive row compression
  • Blank.— Is not enabled for row compression

The ROWCOMPMODE column of the SYSCAT.TABLES system catalog view does not exist in earlier versions of DB2. In this case, all tables that have row compression enabled implicitly use classic row compression.

If you have DB2 10.1 for Linux, UNIX, and Windows installed, the following query can be used to obtain the compression settings for all user tables that have been defined.

Listing 1. Obtaining compression settings with a query
  SELECT SUBSTR(TABSCHEMA, 1, 10) AS TABSCHEMA,
         SUBSTR(TABNAME, 1, 10) AS TABNAME,
         COMPRESSION, ROWCOMPMODE
  FROM SYSCAT.TABLES
  WHERE TABSCHEMA NOT LIKE 'SYS%'

When this query is executed, you should see results that look something like Listing 2.

Listing 2. Query results showing compression settings
  Sample results are as follows:
  TABSCHEMA  TABNAME    COMPRESSION ROWCOMPMODE
  ---------- ---------- ----------- -----------
  DB2INST1   ACCTCR     R           S
  DB2INST1   BKPF       R           A
  DB2INST1   BSIS       B           A
  DB2INST1   CDCLS      N
  DB2INST1   CDHDR      V
  DB2INST1   COSP       B           S

    6 record(s) selected.

In this example, the ACCTCR, BKPF, BSIS, and COSP tables are all enabled for row compression. The ACCTCR and COSP tables use classic row compression, and the BKPF and BSIS tables use adaptive row compression. The BSIS and COSP tables also use the alternative row format, whereas the ACCTCR and BKPF tables use the standard row format. The CDHDR table uses the alternative row format without row compression, and the CDCLS table uses the standard row format without row compression.


Building a compression dictionary

Before classic row compression can take effect in a table that has been enabled for compression, a dictionary must exist for the table. With DB2 9, you had to be conscious of whether a compression dictionary existed and you had to initiate the creation of a table-level dictionary if necessary. A table-level compression dictionary can be built (and existing data in a table can be compressed) by performing an offline (classic) table reorganization operation. Such an operation is initiated by executing the REORG command with the KEEPDICTIONARY or the RESETDICTIONARY option specified. If the REORG command is executed with either option specified, and a compression dictionary does not exist, a new dictionary will be created. On the other hand, if the REORG command is executed with either option specified, and a compression dictionary already exists, the existing dictionary will either be recreated (RESETDICTIONARY) or left as it is (KEEPDICTIONARY) and data in the table will be reorganized and compressed.

To create a new compression dictionary for a table named EMPLOYEE that has been enabled for deep compression, (as well as reorganize and compress the table's data) you would execute the following command:

  REORG TABLE employee RESETDICTIONARY

When this command is executed, data stored in the EMPLOYEE table will be analyzed, a compression dictionary will be constructed and stored at the beginning of the table, and all existing data will be compressed and written to the table directly behind the compression dictionary. Figure 4 shows how the EMPLOYEE table would look before and after deep compression is applied.

Figure 4. How data in a table is altered when a compression dictionary is built and classic row compression is applied
Diagram shows compressed data written to table

It is important to note that because records in a compressed table are moved between storage and memory in compressed form, records for compressed tables written to transaction log files are compressed as well. (The compression dictionary is loaded into memory when the table is accessed so compression and decompression can take place.)


Automatic Dictionary Creation (ADC)

As you might imagine, optimal compression ratios are achieved when a compression dictionary is built from an all-inclusive set of data. Consequently, when a compression dictionary is built by reorganizing a table, a high compression ratio results because every row in the table is used. However, testing has shown that a good compression ratio is also possible when just a small amount of representative data is analyzed. (In some cases, evaluating less than 1 percent of the total number of rows available has yielded a compression ratio of 45 percent.) This concept forms the basis for a compression feature that was added in DB2 9.5 called Automatic Dictionary Creation (ADC).

In DB2 9.5 and later, if a table is enabled for compression at the time it's created, ADC will cause a compression dictionary to be built automatically once a sufficient amount of data has been stored in the table. The threshold at which ADC kicks in and begins constructing the compression dictionary depends on the table's row size. Dictionary construction typically begins when 1-2 MB of pages have been allocated to the table. At that point, ADC will check to see how much user data is contained within the table; if at least 700 KB of data is present, a compression dictionary will be built. (Note that these values are set internally and can't be altered.) Operations that can trigger ADC include inserts, imports, loads, and the redistribution of data across partitions.

Unlike the compression dictionary built by a REORG operation, the dictionary created by way of ADC is stored in the table at the end of the existing data. The table's pre-existing records remain uncompressed until an offline table reorganization operation is performed or until the pre-existing records are updated (in which case, each record modified is compressed when the changes are saved). New records are compressed as they're added. (One goal of ADC is to build a compression dictionary that will yield a decent compression ratio without leaving a large amount of uncompressed data in the table.) Figure 5 shows what a table enabled for compression during creation would look like before, during, and after a compression dictionary is built by ADC.

Figure 5. How ADC works
Image shows flow of creation of table, data inserted, dictionary built, then table contains table-level compression dictionary and finally table and page-level compressed data

When compression is enabled for a table populated (by setting the COMPRESS attribute to ON), a compression dictionary isn't created automatically. Rather, the next time a table growth action occurs, ADC will be triggered, and a small number of records at the beginning of the table will be used to construct a compression dictionary for the entire table. Once the dictionary is created, data added to the table by subsequent insert, import, load, and redistribution operations will be compressed; pre-existing data will remain uncompressed.

As you can see, the automatic creation of a compression dictionary is controlled, in part, by a table's compression attribute. To prevent ADC behavior, don't enable a table for compression until you're ready to manually build a compression dictionary and compress the data. On the other hand, if you elect to take advantage of ADC, remember that the compression ratio for the dictionary produced may not be as optimal as one created by an offline table reorganization. Also, because the table will remain online while the compression dictionary is built, the transaction that causes ADC to be initiated will experience a slight negative impact on performance when the threshold is crossed and ADC is triggered.

It is important to note that dictionaries built with ADC might, over time, yield lower storage space savings than those built using a classic table reorganization. Also, over time, the table-level dictionary for a table whose data is frequently updated might no longer contain the most efficient patterns for the changing data, leading to a degradation of the compression ratio. In such cases, regular classic table reorganizations may be required to maintain consistently high storage space savings.


Identifying candidate tables for row compression

If you are not already using row compression, you might want to examine your database to determine which tables might benefit from having their data compressed. Because data compression helps you initially save storage on existing uncompressed tables and optimize storage growth in the future, you can find your storage pain points by examining existing tables that contain a significant amount of data or tables for which you anticipate substantial growth over time.

Naturally, the largest tables are obvious candidates for compression, but don't overlook smaller tables. If you have hundreds or thousands of small tables, you may benefit from the aggregate effect of compression. Of course, "Large" and "small" are relative terms; your database design determines whether tables of a million or several million rows are considered large or small.

The following query uses the SQL administration function ADMIN_GET_TAB_INFO() to return an ordered list of all table names and table data object sizes for a particular schema. To display the current compression settings and row compression modes in use, the result set produced has been joined with the SYSCAT.TABLES view, and data in the COMPRESSION and ROWCOMPMODE columns has been returned.

Listing 3. Query to look for candidate tables for compression
  SELECT SUBSTR(T.TABSCHEMA, 1, 10) AS TABSCHEMA,
         SUBSTR(T.TABNAME, 1, 10) AS TABNAME,
         SUM(TI.DATA_OBJECT_P_SIZE)/1024/1024 AS STORAGESIZE_GB,
         T.COMPRESSION AS COMPRESSION,
         T.ROWCOMPMODE AS ROWCOMPMODE
  FROM TABLE (SYSPROC.ADMIN_GET_TAB_INFO('DB2INST1', '')) TI
         JOIN SYSCAT.TABLES T ON T.TABSCHEMA = TI.TABSCHEMA AND
                                 T.TABNAME = TI.TABNAME
  GROUP BY T.TABSCHEMA, T.TABNAME, T.COMPRESSION, T.ROWCOMPMODE
  ORDER BY STORAGESIZE_GB DESC

This query identifies the top space consumers in your current database and gives you a list of candidate tables you should start working with. In many practical scenarios, the bulk of the storage space of a database is occupied by relatively few tables.

After you determine candidate tables based on their storage consumption, consider the typical SQL activity against the data in those tables:

  • Tables that are read only are excellent candidates for row compression. Tables with a read/write ratio of 70 percent or more reads and 30 percent or less writes are good candidates for row compression.
  • Tables that experience only a limited number of updates are likely to be good candidates for row compression.
  • Tables that undergo heavy updates might not be as good candidates for row compression.
  • Large tables accessed mostly through table scans rather than index scans are good candidates for compression. This generally includes large fact tables in data warehouses, where a significant number of queries perform heavy aggregations. The I/O savings and the increased buffer pool utilization that result from applying compression are likely to improve query performance for such tables.

Whether it is better to use classic or adaptive row compression depends less on the data access patterns than on the actual compression savings that you can achieve by using either type of compression. You make the decision about which row compression mode to choose later in the process.

Row compression performs best in I/O or memory-bound environments where the workload is not bottlenecked on the CPU. Extra CPU cycles are required to perform row compression and expansion of data rows whenever they are accessed or modified. This overhead can be offset by efficiencies that are gained in doing fewer I/O operations. Row compression works very well with decision-support workloads comprising complex analytic queries that perform massive aggregations where row access is mostly sequential and less random.


Estimating storage savings from using row compression

After you have created a list of candidate tables for compression, based on storage consumption and data access characteristics, the next step is to determine the storage space savings you can expect to achieve if those tables are indeed compressed. You can estimate the space savings for any table before you enable it for row compression. Furthermore, you can estimate storage space savings even if you do not have a license for the DB2 Storage Optimization Feature.

As with the compression functionality in general, the mechanisms for compression estimation have evolved over time, with the primary goal being to provide faster and simpler ways to perform this task. The mechanism of choice depends on which version of the DB2 software you are using. All the tools and functions available in older releases are also available in DB2 10.1. However, you might find that the improved counterparts in DB2 10.1 are easier to use or faster to respond with the data you are looking for.

Estimating row compression savings in DB2 10.1

In DB2 10.1, the preferred way to estimate compression ratios is by using the ADMIN_GET_TAB_COMPRESS_INFO() administrative function. You can use this to estimate compression savings for a particular table, to estimate compression savings for all tables in a particular schema, or to estimate compression savings for all tables in a database. In addition, this function calculates current compression savings, along with savings projections, for classic and adaptive row compression.

The syntax for the ADMIN_GET_TAB_COMPRESS_INFO() table function is ADMIN_GET_TAB_COMPRESS_INFO (TableSchema, TableName) where:

  • TableSchema identifies by name the schema where the table that compression information is to be obtained for resides. (If this parameter contains an empty or NULL value, all schemas are assumed.)
  • TableName identifies by name the table that compression information is to be obtained for. (If this parameter contains an empty or NULL value, all table names are assumed.)

If you do not specify a table name, this function computes estimates for all tables in a particular schema; if you do not specify a table name or a schema, it computes estimates for all tables in a database. (If your schema or database contains hundreds or thousands of tables, the processing time might be substantial; in this case, try restrict your queries to compute only estimates for those tables for which you are considering compression.)

The structure of the table that is returned by the ADMIN_GET_TAB_COMPRESS_INFO() function can be seen in Table 1.

Table 1. Table returned by the ADMIN_GET_TAB_COMPRESS_INFO() function
Column NameData TypeDescription
TABSCHEMAVARCHAR(128)Schema name
TABNAMEVARCHAR(128)Table name
DBPARTITIONNUMSMALLINTDatabase partition number
OBJECT_TYPEVARCHAR(4)The type of object for which compression information is being reported, which can be one of the following:
  • 'XML' = Compression information is being reported for XML data
  • 'DATA' = Compression information is being reported for relational data
ROWCOMPMODECHAR(1)The current row compression mode for the object, which can be one of the following:
  • 'S' = Classic row compression is enabled
  • 'A' = Adaptive row compression is enabled
  • Blank = Row compression is not enabled
PCTPAGESSAVED_CURRENTSMALLINTCurrent percentage of pages saved from row compression
AVGROWSIZE_CURRENTSMALLINTCurrent average record length
PCTPAGESSAVED_STATICSMALLINTEstimated percentage of pages saved from classic row compression
AVGROWSIZE_STATICSMALLINTEstimated average record length from classic row compression
PCTPAGESSAVED_ADAPTIVESMALLINTEstimated percentage of pages saved from adaptive row compression
AVGROWSIZE_ADAPTIVESMALLINTEstimated average record length from adaptive row compression

Thus, if you wanted to get an estimate of how much storage space will be saved if the data in every table in a schema named DB2INST1 is compressed, you could do so by executing a query that looks like Listing 4.

Listing 4. Query to estimate storage savings
  SELECT SUBSTR(TABNAME,1,10) AS TABNAME,
                PCTPAGESSAVED_CURRENT,
                PCTPAGESSAVED_STATIC,
                PCTPAGESSAVED_ADAPTIVE
    FROM TABLE(SYSPROC.ADMIN_GET_TAB_COMPRESS_INFO('DB2INST1', ''))

Using the result of this query, you can easily determine whether to enable row compression for a particular table and which row compression mode you should use — for example, if the previous query returned a result set that looked like Listing 5.

Listing 5. Sample results
  TABNAME    PCTPAGESSAVED_CURRENT PCTPAGESSAVED_STATIC PCTPAGESSAVED_ADAPTIVE
  ---------- --------------------- -------------------- ----------------------
  ACCTCR                         0                   68                     72
  BKPF                           0                   83                     90
  BSIS                           0                   82                     90
  CDCLS                          0                   11                     17
  CDHDR                          0                   70                     73
  COSP                           0                   87                     91

    6 record(s) selected.

Of the six tables for which compression estimates were computed, five show very good compression potential. The CDCLS table is the only table that may not benefit from compression. And in every case, it appears that adaptive row compression will yield better results than classic row compression.

Calculating the differences between the values of the PCTPAGESSAVED_STATIC and PCTPAGESSAVED_ADAPTIVE columns helps you determine the best choice for the row compression mode. However, be aware that the scale of the value of the PCTPAGESSAVED_CURRENT column is not linear. For example, consider the ACCTCR and COSP tables shown in the previous example. For the ACCTCR table, you can expect 68-percent space savings with classic row compression and 72-percent space savings with adaptive row compression. For the COSP table, the estimated savings are 87 percent with classic row compression and 91 percent with adaptive row compression. Although the absolute difference between both types is 4 percent for either table, the relative savings that adaptive row compression can achieve over classic row compression are different. Assume that each of the tables is 100 GB in size when compression is not enabled. For the ACCTCR table, the estimated sizes are 32 GB with classic row compression and 28 GB with adaptive row compression, which constitutes a difference of approximately 12.5 percent. However, the estimated sizes for the COSP table are 13 GB with classic row compression and 9 GB with adaptive row compression, which is a difference of approximately 30 percent.

In the previous example, the COSP table is a good candidate for adaptive row compression, but the ACCTCR table might not be a good candidate. Also, the estimated values indicate that adaptive compression would reduce the storage size for table CDHDR by only another 10 percent over classic row compression. Consequently, you might not want to enable adaptive compression for the CDHDR and ACCTCR tables unless you expect the data characteristics to change significantly or you expect lots of new data to be inserted. For the rest of the tables, adaptive compression might be the better choice because of the significant additional storage space savings of up to 45 percent.

Estimating row compression savings in DB2 9.7 and 9.5

In DB2 9.7 and 9.5, administrative functions that are very similar to the ADMIN_GET_TAB_COMPRESS_INFO() function that comes with DB2 10.1 are available, although they have slightly different names and signatures. In DB2 9.7, the function is called ADMIN_GET_TAB_COMPRESS_INFO_V97(), and in DB2 9.5, its name is ADMIN_GET_TAB_COMPRESS_INFO(). Unlike the DB2 10.1 function that takes two input parameters, the DB2 9.5 and 9.7 functions take three. The third parameter is the execution mode, and for compression estimation, you pass the string 'ESTIMATE' as the value for this parameter. The structure of the tables returned by the DB2 9.5 and 9.7 functions also differ significantly from the structure of the table produced by the DB2 10.1 function; the structure of the table returned by the DB2 9.5 and 9.7 functions can be seen in Table 2.

Table 2. Table returned by the DB2 9.5 and 9.7 functions
Column nameData typeDescription
TABSCHEMAVARCHAR(128)Schema name
TABNAMEVARCHAR(128)Table name
DBPARTITIONNUMSMALLINTDatabase partition number
DATA_PARTITION_IDINTEGERData partition number
COMPRESS_ATTRCHAR(1)The state of the COMPRESS attribute on the table, which can be one of the following:
  • 'Y' = Row compression is set to YES
  • 'N' = Row compression is set to NO
DICT_BUILDERVARCHAR(30)Code path taken to build the compression dictionary, which can be one of the following:
  • 'NOT BUILT' = no dictionary available
  • 'INSPECT' = INSPECT ROWCOMPESTIMATE
  • 'LOAD' = LOAD INSERT/REPLACE
  • 'REDISTRIBUTE' = REDISTRIBUTE
  • 'REORG' = REORG RESETDICTIONARY
  • 'TABLE GROWTH' = INSERT, IMPORT (which uses INSERT), or size increasing updates, which cause DB2 to put the updated record on a new page
DICT_BUILD_TIMESTAMPTIMESTAMPDate and time the compression dictionary was built; if no dictionary is available, the timestamp is NULL
COMPRESS_DICT_SIZEBIGINTSize of compression dictionary, measured in bytes
EXPAND_DICT_SIZEBIGINTSize of expansion dictionary, measured in bytes
ROWS_SAMPLEDINTEGERNumber of records that contributed to building the dictionary; migrated tables with compression dictionaries will return NULL in this column
PAGES_SAVED_PERCENTSMALLINTPercentage of pages saved from compression. This information is based on the record data in the sample buffer only. Migrated tables with compression dictionaries will return NULL in this column.
BYTES_SAVED_PERCENTSMALLINTPercentage of bytes saved from compression. This information is based on the record data in the sample buffer only. Migrated tables with compression dictionaries will return NULL in this column.
AVG_COMPRESS_REC_LENGTHSMALLINTAverage compressed record length of the records contributing to building the dictionary. Migrated tables with compression dictionaries will return NULL in this column.

Furthermore, the processing time for the administrative functions might be significantly longer than in DB2 10.1 because a full table scan is performed for each table to compute the compression estimates.

Estimating row compression savings in DB2 9

DB2 9, the first DB2 release to support row compression, does not provide any of the administrative functions introduced in later releases. However, if you are using this release, you can still determine which tables will benefit the most from row compression by using the Inspect utility. Invoked by executing the INSPECT command with the ROWCOMPESTIMATE option specified, the Inspect utility will examine each row in a particular table, build a compression dictionary from the data found, and use this dictionary to estimate how much space will be saved if the data in the table is compressed.

For example, to estimate how much storage space will be saved if the data in a table named EMPLOYEE is compressed, you would execute an INSPECT command like this:

  INSPECT ROWCOMPESTIMATE TABLE NAME employee RESULTS employee.rowcompestimate.out

Running this with the RESULTS option specified will cause the Inspect utility to save the output produced in a file that resides in the diagnostic data directory. However, before you can examine the information stored in this file, you must format it using db2inspf: db2inspf employee.rowcompestimate.out employee.rowcompestimate.txt.

The formatted file produced will contain information about the estimated compression savings that, when viewed, will look something like Listing 6.

Listing 6. Estimating row compression savings
  DATABASE: TEST
  VERSION : SQL10010
  2011-11-07-18.39.23.555355

  Action: ROWCOMPESTIMATE TABLE
  Schema name: DB2INST1
  Table name: EMPLOYEE
  Tablespace ID: 4  Object ID: 8
  Result file name: employee.rowcompestimate.out

      Table phase start [...]
  
        Data phase start. Object: 8  Tablespace: 4
        Row compression estimate results:
        Percentage of pages saved from compression: 68
        Percentage of bytes saved from compression: 68
        Compression dictionary size: 32640 bytes.
        Expansion dictionary size: 32768 bytes.
        Data phase end.
      Table phase end.
  Processing has completed. 2011-11-07-18.40.08.349345

If a table is enabled for deep compression (i.e., the COMPRESS attribute is set to YES) before the INSPECT command is executed, the compression dictionary that is built and used to estimate space savings will be written to the table, at the end of the existing data — provided a compression dictionary doesn't already exist. (Otherwise, the compression dictionary created will be destroyed.) Figure 6 shows a table before and after the Inspect utility was used to estimate storage savings if it was enabled for deep compression before the estimate was acquired.

Figure 6. How data in a table is altered when a table enabled for deep compression is evaluated by the Inspect utility
Diagram shows once inspect utility is run, table-level compression dictionary is added

Regardless of the method used to create it, once a compression dictionary has been constructed and written to a table, new records added to that table will automatically be compressed. If the compression dictionary was created by the Inspect utility, pre-existing records in the table will remain uncompressed until an offline table reorganization operation is performed or the pre-existing records are updated (in which case each record modified will be compressed).

In later DB2 releases, you should not use the Inspect utility to build a table-level dictionary but should rely on ADC instead.


Index compression

You can apply compression to indices if you are using DB2 9.7 or later and have a license for the DB2 Storage Optimization Feature. Compressing indices results in fewer leaf pages, which helps reduce the depth of the index tree, in many cases. This allows for fewer index page accesses to find a specific key and, as in the case of row compression, allows for a higher utilization of buffer pools and requires fewer physical I/O operations. In many practical cases, index compression can significantly improve query performance.

You can enable index compression for each index individually. When you create an index, the index inherits its compression setting from the underlying table. That is, all indices you create on tables enabled for classic or adaptive row compression are also compressed. Therefore, you may find it beneficial to create tables with compression enabled or alter tables to enable row compression before you create any indices on them.

Index compression uses a combination of three techniques to reduce the amount of data stored on disk. Space savings are achieved by dynamically adjusting the number of index keys, rather than by reserving space for the highest possible number of keys that can be stored in a page. The index entries, each of which consists of an index key and a RID list, are compressed by eliminating redundant prefixes from the keys in a single page. For duplicate keys, the associated RIDs in each list are compressed by applying a delta encoding.

Like adaptive row compression and unlike classic row compression, index compression is fully automatic. Optimal compression savings are maintained over time, and there is no need to monitor and potentially reorganize indexes to improve compression ratios.


Identifying candidate indexes for index compression

The best candidates for compression are indices with long keys (for example, multicolumn indexes or indexes on character data) and indices that contain many duplicate keys (for example, ones based on a single or a limited number of low-cardinality columns). In addition, indices that involve variable-length columns (for example, VARCHAR columns) might benefit from the better space utilization that dynamic index page space management offers.

Multicolumn indexes generally compress better if the leading key columns have low to medium cardinality, and the higher-cardinality or unique columns are toward the end of the key column list. Depending on the characteristics of the queries you expect your indices to support, column reordering might be possible without sacrificing query performance. It may also be worthwhile to check the order and cardinalities for the columns of your larger indices, especially in data warehousing scenarios. In these cases, the majority of queries aggregate values over a larger number of qualifying rows, and query characteristics are generally more predictable than in OLTP workloads.

Examples of indices that might benefit less from compression are those on a single unique numeric column or unique multicolumn ones that have the highest-cardinality column as the leading index column.


Estimating storage space savings for index compression

When deciding on which indices to compress, estimation of storage space savings plays a bigger role than it does for tables. The computational overhead associated with index compression is less than for row compression, and storage space savings generally correspond more directly to overall performance improvements. If you can achieve significant space savings for any of your indices in an OLTP workload, it is a good idea to compress them.

Similar to the ADMIN_GET_TAB_COMPRESS_INFO() function, ADMIN_GET_INDEX_COMPRESS_INFO() can be used to estimate storage space savings for index compression. This can be used to estimate compression savings for a single index, for all indices on a particular table, for all indices in a particular schema, or for all indices in a database; the execution mode is controlled by the first three input parameters. The syntax for the ADMIN_GET_INDEX_COMPRESS_INFO() table function is:

ADMIN_GET_INDEX_COMPRESS_INFO (ObjectType, ObjectSchema, ObjectName, Member, DataPartitionID)

where:

  • ObjectType identifies the type of object that compression savings information is to be obtained for. This parameter must be assigned one of the following case-sensitive values:
    • 'T', NULL, or an empty string to indicate a table
    • 'I' for an index
  • ObjectSchema identifies by name the schema where the object resides.
  • ObjectName identifies by name the object that compression information is to be obtained for resides.
  • Member identifies a database member number. (When a value is provided for this parameter, information is only returned for indices that reside on the specified database member.)
  • DataPartitionID identifies the data partition ID. (When a value is provided for this parameter, information is only returned for index partitions that have been defined on the data partition specified.)

If you want to estimate space savings for a single index or for all indices in a particular schema, specify an object type of 'I'. To estimate space savings for all indices on a particular table or on all tables in a schema or database, specify an object type of 'T', NULL, or an empty string. Depending on the value of the object type parameter, you can use the object name and object schema parameters to identify the table or index to analyze. Alternatively, you can leave the values of the object name and object schema parameters as NULL or empty strings to estimate compression savings for a range of indices.

The structure of the table returned by the ADMIN_GET_INDEX_COMPRESS_INFO() function is shown in Table 3.

Table 3. Table returned by the ADMIN_GET_INDEX_COMPRESS_INFO() function
Column nameData typeDescription
INDSCHEMAVARCHAR(128)Index schema name
INDNAMEVARCHAR(128)Index name
TABSCHEMAVARCHAR(128)Table schema name
TABNAMEVARCHAR(128)Table name
DBPARTITIONNUMSMALLINTDatabase partition number
IIDINTEGERIndex identifier
DATAPARTITIONIDINTEGERData partition ID
COMPRESS_ATTRCHAR(1)The state of the COMPRESSION attribute on the index, which can be one of the following:
  • 'Y' = Index compression is set to YES
  • 'N' = Index compression is set to NO
INDEX_COMPRESSEDCHAR(1)The physical index format, which can be one of the following:
  • 'Y' = Index is in compressed format
  • 'N' = Index is in uncompressed format
If the physical index format does not match the compression attribute, an index reorganization is needed to convert the index to the defined format. If the table or index is in error at the time this function is executed, this value is NULL.
PCT_PAGES_SAVEDSMALLINTPercentage of pages saved from compression. If the index is not physically compressed (INDEX_COMPRESSED is "N"), this value represents the estimated percentage of leaf pages saved, as if the index were actually compressed. If the index is physically compressed (INDEX_COMPRESSED is "Y"), this value reports the PCTPAGESSAVED value from the system catalog view (SYSCAT.INDEXES or SYSCAT.INDEXPARTITIONS).
NUM_LEAF_PAGES_SAVEDSMALLINTThe number of leaf pages saved. If the index is not physically compressed (INDEX_COMPRESSED is "N"), this value represents the estimated number of leaf pages saved as if the index were actually compressed. If the index is physically compressed (INDEX_COMPRESSED is "Y"), this value reports the calculated number of leaf pages saved, based on the PCTPAGESSAVED and NLEAF values from the system catalog view (SYSCAT.INDEXES or SYSCAT.INDEXPARTITIONS). If PCTPAGESSAVED or NLEAF are invalid values (-1), this value is set to -1 as well.

Thus, if you wanted to get an estimate of how much storage space will be saved if the data in every index on a table named EMPLOYEE (in a schema named DB2INST1) is compressed, you could do so by executing a query that looks like Listing 7.

Listing 7. Example query to obtain index compression
  SELECT SUBSTR(INDNAME, 1, 20) AS INDNAME,
         COMPRESS_ATTR,
         PCT_PAGES_SAVED
    FROM TABLE(SYSPROC.ADMIN_GET_INDEX_COMPRESS_INFO('T', 'DB2INST1', 'EMPLOYEE', 
                                                     NULL, NULL))

Using the result of this query, you can easily determine whether to enable compression for a particular index. For example, if the previous query returned a result set that looked like Listing 8:

Listing 8. Query results for index compression
INDNAME              COMPRESS_ATTR PCT_PAGES_SAVED
-------------------- ------------- ---------------
EMPLOYEE~0           N                          46
EMPLOYEE~1           N                          57
EMPLOYEE~2           N                          71
EMPLOYEE~3           N                          71
EMPLOYEE~4           N                          45
EMPLOYEE~5           N                          71
EMPLOYEE~6           N                          70
EMPLOYEE~BUT         N                          71

  8 record(s) selected.

it would be easy to see that most of the indexes on the EMPLOYEE table compress very well, with as much as a 70-percent savings in storage space.

If you use the ADMIN_GET_INDEX_COMPRESS_INFO() function to analyze indices already compressed, the compression savings returned reflect the actual savings gleaned, rather than estimates. Alternatively, you can obtain information about the current savings realized by index compression by examining the PCTPAGESSAVED column of the SYSCAT.TABLES catalog view. (The RUNSTATS command maintains the value of this column.)


Compression of temporary tables

Starting with DB2 9.7, if you have a license for the DB2 Storage Optimization Feature, you can apply compression to temporary tables. Unlike with row and index compression, you do not have to enable temporary tables for compression. Compression is done automatically and applies to user-defined and system temporary tables. (User-defined global temporary tables come in two variants: created global temporary tables (CGTTs) and declared global temporary tables (DGTTs). System temporary tables are used by some utilities and maintenance operations, such as table reorganization and data redistribution. During query processing, the DB2 database manager may also create and use temporary tables for operations that must accumulate intermediate results, such as sorts, hash joins, or table queues.)

The mechanism used to compress temporary tables is the same as that used for classic row compression with ADC, although the runtime behavior is slightly different from permanent tables. Most temporary tables, especially small ones, do not cause any physical I/O. Thus, the threshold at which the compression dictionary is built is 100 MB instead of 2 MB. The higher threshold ensures that small temporary tables that typically remain fully buffered are not compressed, but that larger temporary tables that might spill to disk contain compressed data. In addition to avoiding physical I/O, compression for temporary tables ensures that large temporary tables use buffer pools more efficiently, which helps to further avoid physical disk I/O.


Adoption strategies

After you determine which set of tables and indices to compress, the next step is to enable these tables and indices for compression. Changing the compression settings for existing tables already populated helps slow their growth rate, and data is being inserted benefits from the new compression settings. If your goal is to reduce the footprint of your existing data, perhaps because you want to reduce the physical space that is utilized by your database, you must apply compression to this data.

You can use the following strategies to help apply compression to large amounts of existing data, free up disk space currently consumed by that data, and release this space to the file system. You will find this information most useful when you first implement row or index compression or when you are upgrading to DB2 10.1 from earlier releases of DB2.

Applying compression to existing data

The most straightforward way to apply compression to existing data is by performing a classic table reorganization. If you changed the settings for row or value compression, execute the REORG TABLE command with the RESETDICTIONARY specified. If some of your tables have XML columns, you should specify the LONGLOBDATA clause as well to compress the non-inlined XML documents. If you changed the compression settings only for indexes, the REORG INDEXES ALL command is sufficient to compress the index data. A full reorganization of the underlying table's data is not necessary. You might consider performing an index reorganization, rather than a full reorganization of all table data if, for example, you are upgrading from DB2 9.5 to DB2 10.1 and already have row compression enabled.

If you cannot take your tables offline, consider using the ADMIN_MOVE_TABLE() stored procedure, available since DB2 9.7, as well. You can use this procedure to move data stored in an active table into a new data table object with the same name, that may or may not reside in the same storage location. The ADMIN_MOVE_TABLE() procedure provides a way to effectively reorganize a table in the same way that a classic table reorganization would, but without requiring any downtime — the data remains online and accessible for the duration of the move.

Table-level compression dictionaries that are built (or rebuilt) by the ADMIN_MOVE_TABLE() procedure are of the same quality as those constructed as during a table reorganization. To build or rebuild a compression dictionary, the ADMIN_MOVE_TABLE() procedure performs the following steps:

  1. Collects a Bernoulli sample of all the rows in the table.
  2. Builds a new table-level compression dictionary from that sample.
  3. Inserts the dictionary into the target table before starting the copy phase, so rows are compressed when inserted into the target table during the copy phase.

The following example shows how you initiate an online table move operation in the most straightforward way.

  CALL SYSPROC.ADMIN_MOVE_TABLE('DB2INST1','ACCTCR',
                           '','','','','','','','','MOVE')

You can use the ADMIN_MOVE_TABLE_UTIL() procedure to exercise fine-grain control over parameters and the behavior of an online table move operation, such as determining the size of the sample used for dictionary building. However, even if you invoke the function in the standard 'MOVE' execution mode, the compression dictionary that is built is optimal.

Unlike for classic table reorganization, the default behavior for the ADMIN_MOVE_TABLE() procedure is to rebuild the dictionary. Because only a small random sample of the table data is used, rebuilding the dictionary typically does not cause an excessive performance overhead compared to not rebuilding the dictionary.

You might find the ADMIN_MOVE_TABLE() procedure particularly useful when you upgrade your database from an earlier DB2 release. You can use the procedure to move tables between table spaces and change other parameters, such as column data types, at the same time.

Managing disk space consumption

Applying compression to existing tables reduces the number of pages allocated for those tables. However, applying compression does not immediately affect the disk space consumption of DMS or automatic storage table spaces. Those table spaces merely have a larger number of unused extents available to them.

When you first implement row and index compression, it is a recommended practice to reevaluate the utilization of your table space containers and the projected growth rate of your data. With compression enabled, table space utilization can be significantly reduced — it might be a long time before your database grows enough to consume the space initially freed up. Consequently, you might want to reduce table space container sizes and make more disk space available to other consumers.

To effectively reduce the on-disk footprint of a database, you must reduce the size of the table space containers used by the database. And the key factor that determines how much you can reduce the size of table space containers by is the high water mark. (The high-water mark represents the highest allocated page in a table space; it is a DMS-only concept, which also includes automatic storage table spaces.)

The high-water mark can increase when you create tables, extend existing tables, or perform a classic table reorganization that does not use a temporary table space. You can reduce the high-water mark by dropping or truncating the table that owns the extent at the high-water mark. However, dropping, truncating, or reorganizing other objects does not affect the high-water mark, and merely results in creating more unused extents below the high-water mark. Those unused extents below the high-water mark can be reused only when you create new objects or when objects grow. Only unused extents above the high-water mark can be reclaimed and returned to the operating system. Consequently, the key to reducing the on-disk footprint is to lower the high-water mark and truncate the table space containers. How you perform this task is dependant upon what kind of table spaces you are using, which DB2 release you used to create these table spaces, and which DB2 release you are using. (Techniques for lowering the high-water mark and freeing up space are presented in the Returning unused space to the file system section of this article.)

Verifying row compression and avoiding common pitfalls

As mentioned, you can obtain the current compression settings for tables and indices by querying the system catalog. In the SYSCAT.TABLES view, the two columns COMPRESSION and ROWCOMPMODE contain information about the current compression settings for each table. The COMPRESSION column in the SYSCAT.INDEXES view provides information about the compression settings for each index. The RUNSTATS command populates all these columns.

Other columns in the SYSCAT.TABLES view provide deeper insight into how row compression performs:

  • The most important metric is found in the PCTPAGESSAVED column. This indicates how much space you are currently saving on a table by means of row compression.
  • From the PCTROWSCOMPRESSED column, you can determine how many rows in the table are compressed.
  • The AVGCOMPRESSEDROWSIZE column provides the average size of all the compressed rows. In most practical scenarios, the percentage of compressed rows should be near 100 percent. In this case, the average size of compressed rows is identical to the average row size that is maintained in the AVGROWSIZE column.

If the percentage of compressed rows in a table is significantly lower than 100 percent, this situation might be due to one of a few common issues:

  • Your table might be in a regular table space, and compression might have cause the average row size to be very short. For a table in a regular table space, the effective minimal row size is limited by the fact that each page can hold at most 255 rows. If your average row size is near or below that ratio, consider converting the regular table space to a large table space. To do so, execute the ALTER TABLESPACE statement with the CONVERT TO LARGE option specified. Or use the ADMIN_MOVE_TABLE() procedure to move your table to a large table space.
  • If you enable row compression on an existing table with data in it, the table-level dictionary is built automatically by means of ADC when the table grows. Existing data remains in its uncompressed form until you perform a classic table reorganization or you use the ADMIN_MOVE_TABLE() function.
  • When you create a table with classic row compression enabled and start populating the table, the table-level dictionary is built automatically when the table size reaches the ADC threshold of 2 MB. However, the rows in the very beginning of the table remain uncompressed. If your table is only a few MB in size, this initial chunk of uncompressed rows can account for a significant fraction of the total rows that are stored in the table.

If none of the previous situations apply, determine whether the table contains random data. For example, a table might have a large CHAR FOR BIT DATA or inlined LOB column that contains binary data compressed in some form by the application layer. In such cases, row compression might not be able to compress the data any further and should not be enabled on that table.

Pay special attention when you apply row compression to MDC tables. Cell size and data density are important considerations for the physical design of MDC tables and play an important role in choosing appropriate dimensions. In some cases, clustering dimensions are chosen so that cells use a very small number of pages. Applying row compression reduces the size of each record and the average size of the cells. If row compression causes the average cell size to be near or below the size of a single block, you might not see storage space savings reach their full potential. This is because space allocation in MDC tables is done blockwise. Partially used blocks consume a full extent on disk but can contain a significant portion of free space. If you know that you will be using row compression, remember to factor in the compression ratio when you do the calculations for the cell size projection. If you decide to apply row compression on existing MDC tables, revalidate the cell size calculations. You might find it necessary to coarsify some of the dimensions or reduce the extent size of your table spaces.

Verify that index compression is enabled for all the indices you considered for compression. Pay special attention to the system-generated indices — for example indexes that are implicitly created for primary key and unique columns. If you create a table without enabling row compression and subsequently use the ALTER TABLE statement to turn row compression on, existing indices created before the table was enabled for compression will remain uncompressed; you must explicitly enable index compression for those indices by executing the ALTER INDEX statement. For those indices that are enabled for compression, the PCTPAGESSAVED column provides information about the storage space savings.


Returning unused space to the file system

After reducing the amount of storage space needed to house a table's data with deep compression, you may wish to resize the table space where the table resides so any storage space reclaimed can be used somewhere else. If SMS table spaces are used, any storage space reclaimed will be returned to the file system automatically as part of the table reorganization process. However, that's not the case when DMS table spaces are used. Instead, DMS table spaces (and automatic storage table spaces, which are, by default DMS table spaces) can be resized by executing an appropriate form of the ALTER TABLESPACE SQL statement. For example, to reduce by 200 MB the amount of storage space allocated for each container being used by a table space named TBSP1, you would execute an ALTER TABLESPACE statement like this one: ALTER TABLESPACE tbsp1 REDUCE (ALL CONTAINERS 200 M).

You could also reclaim space by dropping one or more storage containers as opposed to resizing all of them. However, the amount of storage space that can actually be returned to the file system for redistribution is dependant upon the location of the high-water mark of the table space being resized.

How the high-water mark affects space reallocation

Whenever a row of data is stored in a table, DB2 assigns that row a unique record identifier, known as a RID. Prior to DB2 9, RIDs consisted of a 3-byte page number and a 1-byte slot number. The slot number is an array entry into a slot directory, which in turn contains the offset into the data page where the row's data physically resides; the page number identifies the data page itself. DB2 9 and later has the ability to support 4-byte RIDs and 6-byte RIDs that consist of a 4-byte page number and a 2-byte slot number (tables that reside in a DMS LARGE table space use 6-byte RIDs). As a result, you can now have data pages that contain more than 255 rows (the old limit with 4-byte RIDs) and a single table partition can grow to 2 TB in size when a 4K page is used (16 TB when a 32K page is used). In DB2 9 and later, 6-byte RIDs are used by default for any new automatic storage table spaces and DMS table spaces created.

DB2 indices use table space-relative RIDs to reference base table records. (An index is an ordered set of pointers that refer to rows in a base table.) This means that every key in an index points to slot X in page Y in a table space, not to slot X in page Y of a base table. As a result, DB2 can't freely move extents around in a table space. If it did, RIDs stored in indices would point to the wrong data, and every index that referenced data in the table space would have to be rebuilt any time a move occurred.

Using table space-relative RIDs improves performance because the extra step of figuring out where a particular page of a table resides in a table space, which would have to be done by scanning the table's extent map pages, is eliminated. Because DB2 can't alter the placement of extents in a table space, only unused storage space that comes after the high-water mark can be freed.

How you perform the offline table reorganization operation required to compress a table's data can have a significant effect on a table space's high-water mark. If you use a temporary table space to reorganize and compress a table's data, the resulting version of the table is constructed in the temporary table space, its data pages are copied over the original table object in the original table space, and the storage space used by the table is truncated. Figure 7 shows how a table space containing four tables (named TABLE_1, TABLE_2, TABLE_3, and TABLE_4) might look before and after all four tables in it have been reorganized and compressed using a temporary table space.

Figure 7. How a table space's high-water mark is affected by table reorganization operations that use a temporary table space
Image shows high water mark higher after compression, because free space is scattered

Once the data in all four tables has been compressed, more free space is available; however, that free space is scattered throughout the table space and the high-water mark hasn't been lowered by any significant amount.

If a temporary table space isn't used to reorganize and compress a table's data, the new version of the table is created in the same table space as the original table, and the original table is eventually removed. Because of this behavior, the order in which tables are reorganized and compressed, as well as where these tables are laid out in the table space (and unfortunately, there is no easy way to determine this), will have a huge impact on table layout within the table space and the placement of the table space's high-water mark. Figure 8 shows how our original table space containing four tables (TABLE_1, TABLE_2, TABLE_3, and TABLE_4) might look before and after all four tables in it have been reorganized and compressed (without using a temporary table space) in the following order: TABLE_1, TABLE_2, TABLE_3, and TABLE_4.

Figure 8. How a table space's high-water mark is affected by table reorganization operations that do not use a temporary table space
Diagram shows reduced high-water mark

In this case, we end up with a high-water mark higher than it was when we began the reorganization and compression operations. However, if we were to reorganize and compress table TABLE_1 again, the table would be moved to the beginning of the free space in the table space and the high-water mark would be lowered significantly. Figure 9 shows how the table space would look after a second reorganization and compression of TABLE_1 is performed.

Figure 9. How a table space's high-water mark is lowered by performing a second reorganization/compression operation on table TABLE_1
Diagram shows higher high-water mark after second reorg and compression

Note that the examples in Figures 7, 8, and 9 are based on the assumption that TABLE_1, TABLE_2, TABLE_3, and TABLE_4 were created and populated one at a time and that the table space TBSP1 has been fully compacted. In reality, this example represents the worst-case scenario. When more than one table exists in a single table space, it is more common to have extents from different tables interleaved throughout the table space. Assessing the degree of extent fragmentation that exists becomes more complex, but is probably less drastic than depicted in these illustrations. Therefore, the actual high-water mark impact made will most likely be significantly less.

When you start adding things like index objects and large objects (LOBs) to the same table space, things get even more confusing, and it's difficult to predict exactly what will happen. You can use the command db2dart /DHWM to show the extent layout for all tables in a table space and the command db2dart /LHWM to interpret this layout and generate a set of steps you can follow to lower the high-water mark. However, the only way to guarantee that things are laid out as compactly as possible is to rebuild the entire table space from scratch by unloading and reloading its tables. Then, and only then, will you be able to return all of the storage space reclaimed by compression to the operating system for redistribution.

Reclaimable storage in DB2 9.7 and later

Non-temporary DMS and automatic storage table spaces you create in DB2 9.7 or later support reclaimable storage. For those table spaces, the database manager can change the physical location of extents in the containers. This mechanism, called extent movement, can be used to consolidate unused space in table space containers to the physical ends of the containers. After consolidation, the high-water mark can then be lowered, and the table space containers can subsequently be shrunk to the smallest possible size. Figure 10 shows how a table space might look before and after extent movement is used to reclaim unused storage in an automatic storage table space.

Figure 10. How an automatic storage table space's high-water mark is lowered via extent movement
Image shows how an automatic storage table space's high-water mark is lowered via extent movement

With reclaimable storage, you can use the ALTER TABLESPACE statement to lower the high-water mark and shrink table space containers. The process of lowering the high-water mark involves extent movement and can take some time. After the high-water mark is lowered, containers can be reduced in size by returning unused extents above the high-water mark to the file system.

To determine how much the size of a DMS or automatic storage table space can be reduced, execute the MON_GET_TABLESPACE() table function, as shown below.

Listing 9. Query to estimate reduction of DMS or automatic storage space
  SELECT SUBSTR(TBSP_NAME, 1, 15) AS TBSP_NAME,
                TBSP_FREE_PAGES,
                TBSP_PAGE_SIZE * TBSP_FREE_PAGES / 1024 / 1024
                  AS TBSP_RECLAIMABLE_MB
    FROM TABLE(MON_GET_TABLESPACE('TBSP_USR_D_1', NULL))

The following sample result set indicates that the table space TBSP_USR_D_1 can be reduced by about 23 million pages, worth 365 MB in the on-disk footprint.

Listing 10. Query results
  TBSP_NAME       TBSP_FREE_PAGES      TBSP_RECLAIMABLE_MB 
  --------------- -------------------- --------------------
  TBSP_USR_D_1                23369760               365150

    1 record(s) selected.

You can easily shrink automatic storage table spaces to their most compact format by using the ALTER TABLESPACE statement with the REDUCE MAX clause: ALTER TABLESPACE tbsp_usr_d_1 REDUCE MAX.

For DMS table spaces, lowering the high-water mark and reducing containers is a two-step process, as shown below:

  ALTER TABLESPACE ts_dat_d LOWER HIGH WATER MARK;
  ALTER TABLESPACE ts_dat_d REDUCE (ALL CONTAINERS 100 G);

When you upgrade your database from an older DB2 release to DB2 9.7 or later, consider taking the following steps to get the benefits of reclaimable storage:

  1. Create new table spaces for all your automatic storage or DMS table spaces.
  2. Enable row and index compression on the tables.
  3. Use ADMIN_MOVE_TABLE() to move your tables from the old to the new table spaces.

You should enable row and index compression before moving the tables, so they get compressed as part of the move process.


Compression and the Load utility

Another significant change introduced in DB2 9.5 is how load operations behave when performed against tables enabled for deep compression. In DB2 9, if a compression dictionary had been created for a table, the Load utility would use that dictionary to compress data as it was being loaded. However, if no compression dictionary was present, the Load utility would not build one as part of the load operation. Beginning with DB2 9.5, the Load utility can construct a compression dictionary provided the table being loaded has been enabled for compression and that a LOAD REPLACE operation is performed. Such an operation is initiated by executing the LOAD command with either the REPLACE KEEPDICTIONARY or the REPLACE RESETDICTIONARY option specified. (A LOAD INSERT operation can also result in the creation of a compression dictionary if the table being loaded has been configured for compression and the amount of data coming in triggers ADC.)

If the LOAD command is executed with the REPLACE KEEPDICTIONARY or the REPLACE RESETDICTIONARY option specified, and a compression dictionary does not exist, a new dictionary will be created. If the KEEPDICTIONARY option is used instead, the amount of data that will be required to build the compression dictionary is subject to the policies of ADC. Therefore, some of the data will be stored in the table uncompressed. Once the dictionary is created, the remaining data that is loaded will be compressed using the new compression dictionary. On the other hand, if the RESETDICTIONARY option is specified, the amount of data required to build the dictionary is not subject to the policies of ADC and a compression dictionary can be built after loading just one row.

If the LOAD command is executed with the REPLACE KEEPDICTIONARY or the REPLACE RESETDICTIONARY option specified, and a compression dictionary already exists, the existing dictionary will either be recreated (RESETDICTIONARY) or left as it is (KEEPDICTIONARY) and data in the table will be compressed using the existing or new dictionary.

To create a new compression dictionary for a compression-enabled table named EMPLOYEE while performing a load operation, you would execute a command similar to this: LOAD FROM datafile.del OF DEL REPLACE RESETDICTIONARY INTO employee.

When this command is executed, assuming no compression dictionary exists for the EMPLOYEE table, a few records found in the file DATAFILE.DEL will be loaded into the EMPLOYEE table uncompressed. As soon as 1-2 MB of data is loaded, ADC will construct a compression dictionary using that data, and the remaining records will be compressed and written to the table directly behind the compression dictionary as they are loaded.


Compression of backup images and log archives

Since DB2 Universal Database 8, you have been able to compress your backup images. Starting with DB2 10.1, you can also apply compression to your log archives. These functions are available regardless of whether you have a license for the DB2 Storage Optimization Feature. (The default algorithm that is used to compress backup images and archived logs is similar to the one used by the UNIX®compress(1) utility.)

Backup compression

You can enable backup compression independently for each backup image. When performing the backup operation, specify the COMPRESS clause — for example: BACKUP DATABASE TDB2 to /vol_aux/backups/ COMPRESS.

If you use row and index compression and you follow the guidelines to reduce the container sizes for your table spaces, the overall size of your database might be significantly reduced. Consequently, your backup images are automatically smaller than they would be if row and index compression were disabled. At the same time, the extra space savings you can achieve by applying compression on backup images might be greatly reduced. This particularly applies when you have many tables that use adaptive row compression. However, compression on backup images can also compress metadata, LOBs, the catalog tables, and other database objects that cannot be compressed by any other means.

When deciding whether to compress your backup images, the most important indicators are the CPU and disk utilization rates while the backup is in progress. Backup compression can cause a significant overhead in CPU consumption. You should use backup compression only if the backup process is bottlenecked on I/O. This can happen, for example, if you store your backup images on a volume that is not striped across different physical disks or if you back up to network-attached or non-local storage media. In those cases, you increase CPU utilization throughout the backup process but the savings in I/O overhead might effectively shorten the backup window.

If you store your backups on Tivoli® Storage Manager (TSM) software, you should use the compression and de-duplication functionality built into TSM. The de-duplication logic can lead to significantly better overall storage space savings if you keep a series of multiple recent backup images.

If only some of your tables use row and index compression, it is a good practice to separate the tables and indices into different table spaces, based on their compression settings. Also, consider taking backups at the table space level rather than at the database level. In this case, you can balance the compression settings and compress only the backups for those table spaces that contain tables and indices that are not compressed.

Archive log compression

If you configured your database for log archiving, you can enable compression for those log archives through database configuration parameters. You can enable or disable log archive compression independently for the primary or secondary archiving method and the corresponding archive locations. Log archive compression requires the database manager to handle the archiving process, which is to say that you must set your archiving method to DISK, TSM, or VENDOR. The following example shows how to set up a primary, disk-based log archive with compression enabled:

  UPDATE DB CFG FOR TDB2 USING LOGARCHMETH1 DISK:/vol_aux/archive/tdb2;
  
  UPDATE DB CFG FOR TDB2 USING LOGARCHCOMPR1 ON;

For the secondary archiving method, you can enable compression in a similar fashion.

Log archive compression, after you enable it, is fully automatic. The database manager automatically compresses log extents when moved from the active log path to the archive location. Upon retrieval of log files, which can happen during ROLLBACK and ROLLFORWARD operations, the database manager automatically expands compressed log files when moved from the archive into the active or overflow log path. If, during recovery, the database manager encounters a compressed log extent in the active or overflow log path, it automatically expands that extent. This can happen, for example, if you manually retrieve log extents from the archive location into the active or overflow log path.

When deciding whether to enable compression on your log archives, considerations similar to those for backup compression apply. Check the I/O bandwidth available for your archive location and see whether this is the bottleneck during the process of archiving.

Consider enabling compression if your log archives are on non-local volumes or volumes not striped across a number of disks. In addition, consider enabling compression if archiving your logs to an on-disk staging location for subsequent transfer onto tape through the db2tapemgr utility. In this case, compression helps reduce the time that it takes to transfer archived logs to tape.

If archiving to TSM, consider using its built-in compression facilities. The behavior for log archives is similar to that of the compression facilities in DB2 10.1 because the same algorithm is used. Unlike backup compression, there is generally no additional benefit from de-duplication.

When you use row and index compression, some of the data written to the transaction log is already in compressed format. Therefore, the amount of data being logged is less than would be necessary if you performed the same operations on uncompressed tables. However, even if you are widely using row and index compression, there is generally enough metadata and other uncompressed items in the transaction log that you can achieve significant storage space savings on almost any log archive.


Acknowledgement

The authors would like to thank Bill Minor for his contributions to this article.

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edition and provides a solid base to build and deploy applications.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=816256
ArticleTitle=Optimize storage with deep compression in DB2 10
publish-date=05172012