Deciding whether to compress data
You should consider many factors before you decide whether to compress data.
Consider these factors before compressing data:
- Data row size
- DB2® compresses the data
of one record at a time. (The prefix of the record is not compressed.)
As row lengths become shorter, compression yields diminishing returns
because 8 bytes of overhead are required to store each record in a
data page. On the other hand, when row lengths are very long, compression
of the data portion of the row might yield little or no reduction
in data set size because DB2 rows
cannot span data pages. In the case of very long rows, using a larger
page size can enhance the benefits of compression, especially if the
data is accessed primarily in a sequential mode.
If compressing the record produces a result that is no shorter than the original, DB2 does not compress the record.
- Table space size
- Compression can work very well for large table spaces. With small table spaces, the size of the compression dictionary (64 KB) can offset the space savings that compression provides.
- Processing costs
- Decompressing a row of data costs significantly less than compressing that same row. The access path that DB2 chooses impacts the processor cost for data compression. In general, the relative overhead of compression is higher for table space scans and is less costlier for index access.
- I/O costs
- When rows are accessed sequentially, fewer I/Os might be required
to access data that is stored in a compressed table space. However,
the reduced I/O resource consumption is traded for extra processor
cost for decoding the data.
- If random I/O is necessary to access the data, the number of I/Os does not decrease significantly, unless the associated buffer pool is larger than the table and the other applications require little concurrent buffer pool usage.
- Some types of data compress better than others. Data that contains hexadecimal characters or strings that occur with high frequency compresses quite well, while data that contains random byte frequencies might not compress at all. For example, textual and decimal data tends to compress well because certain byte strings occur frequently.
- Data patterns
- The frequency of patterns in the data determines the compression savings. Data with many repeated strings (such as state and city names or numbers with sequences of zeros) results in good compression savings.
- Table space design
- Each table space or partition that contains compressed data has
a compression dictionary. The compression dictionary
is built when you populate the table space with data.
The dictionary contains a fixed number of entries, usually 4096, and resides with the data. The dictionary content is based on the data at the time it was built, and does not change unless the dictionary is rebuilt or recovered, or compression is disabled with ALTER TABLESPACE.
If you use the REORG utility to build the compression dictionary, DB2 uses a sampling technique to build the dictionary. This technique uses the first n rows from the table space and then continues to sample rows for the remainder of the UNLOAD phase. The value of n is determined by how much your data can be compressed. In most cases, this sampling technique produces a better dictionary and might produce better results for table spaces that contain tables with dissimilar kinds of data.
Otherwise, DB2 uses only the first n rows added to the table space to build the contents of the dictionary.
If you have a table space that contains more than one table, and the data used to build the dictionary comes from only one or a few of those tables, the data compression might not be optimal for the remaining tables. Therefore, put a table that you want to compress into a table space by itself, or into a table space that only contains tables with similar kinds of data.
- Existing exit routines
- An exit routine is executed before compressing or after decompressing, so you can use DB2 data compression with your existing exit routines. However, do not use DB2 data compression in conjunction with DSN8HUFF. (DSN8HUFF is a sample edit routine that compresses data using the Huffman algorithm, which is provided with DB2. This adds little additional compression at the cost of significant extra CPU processing.
- Logging effects
- If a data row is compressed, all data that is logged because of
SQL changes to that data is compressed. Thus, you can expect less
logging for insertions and deletions; the amount of logging for updates
varies. Applications that are sensitive to log-related resources can
experience some benefit with compressed data.
External routines that read the DB2 log cannot interpret compressed data without access to the compression dictionary that was in effect when the data was compressed. However, using IFCID 306, you can cause DB2 to read log records of compressed data in decompressed format. You can retrieve those decompressed records by using the IFI function READS.
- Distributed data
- DB2 decompresses data before transmitting it to VTAM®.