Introducing DB2 9, Part 1: Data compression in DB2 9

Reduce storage requirements, improve I/O efficiency, and provide quicker access to data from disk with the advanced data compression in IBM® DB2® 9 "Venom" technology. This article takes a look at how it works, explains the benefits, and shows sample results of space savings. For more information on the new features of DB2 9, read other articles in the series.

Share:

Rav Ahuja (rsahuja@ca.ibm.com), DB2 Program Manager, IBM

Rav Ahuja is a worldwide DB2 program manager based at the IBM Toronto Lab. He has been working with DB2 for Linux, UNIX, and Windows since version 1 and has held various roles in DB2 development, technical support, marketing, and product strategy. He works with customers and partners around the globe helping them build and benefit from DB2 and services-based solutions. Rav is a frequent contributor to DB2 papers, articles and books. He holds a Computer Engineering degree from McGill University and MBA from University of Western Ontario.



24 May 2006

Also available in Russian

Introduction

Disk storage systems can often be the most expensive components of a database solution. For large warehouses or databases with huge volumes of data, the cost of the storage subsystem can easily exceed the combined cost of the hardware server and the data server software. Therefore, even a small reduction in the storage subsystem can result in substantial cost savings for the entire database solution.

This article introduces DB2 9 (formerly codenamed "DB2 Viper") Venom technology that compresses row data to reduce storage requirements, improve I/O efficiency, and provide quicker data access from the disk.

Data row compression - how it works

"Venom" technology in DB2 9 uses a dictionary based algorithm for compressing data records. That is, DB2 9 can compress rows in database tables by scanning tables for repetitive, duplicate data and building dictionaries that assign short, numeric keys to those repetitive entries. Text data tends to compress well because of recurring strings as well as data with lots of repeating characters, or leading or trailing blanks.

DB2 examines entire rows for repeating entries or patterns, and not just particular fields or parts of rows. Take for example the following two rows:

Table 1. Example rows
NameDeptSalaryCityStateZipCode
Fred50010000PlanoTX24355
John50020000PlanoTX24355

In this example, not only the repeating values (of "500") in the Dept column are compressed, but the repeating pattern (of "Plano, TX, 24355") that spans the City, State, and ZipCode columns is also compressed as a single value. Figure 1 compares how DB2 would store the row normally and in compressed formats:

Figure 1. Comparison of uncompressed and compressed data storage
Comparison of uncompressed and compressed data storage

The dictionary for compression/decompression lookup is stored in hidden objects in the database, occupies little space, and is cached in memory for quick access. Even for extremely large tables, the compression dictionary is typically in the order of 100 KB. There can be instances when certain data sets do not compress well or data size conditions that don't result in much compression. DB2 has intelligent algorithms to determine such scenarios and does not perform compression when it does not yield any disk-space-saving benefits.

The data row compression feature in DB2 for Linux®, UNIX®, and Windows® is similar to the compression capabilities available on DB2 for z/OS®; however, it differs from the page-level compression techniques offered by other database vendors, where a compression dictionary is built for each page in the database. By building a compression dictionary at the table rather than page level, patterns across the entire table are analyzed, generally resulting in improved disk savings with DB2.

Enabling compression

Data row compression in DB2 can be turned on when tables are created using the COMPRESS YES option. It can also be enabled for an existing table using the ALTER TABLE command. For example:

CREATE TABLE Sales COMPRESS YES

Or,

ALTER TABLE Sales COMPRESS YES

The compression takes effect only once the table dictionary is built, which is usually during the table REORG phase.

Or,

REORG TABLE Sales

When compressing a large table, it may be useful to populate the table with a small set of "representative" or sample data first. The process of building the compression dictionary can be fairly quick using a small data set. And if the set is a good representative sample, the compression will work well even on new data that is added to the table, without DB2 having to analyze the new data. If, however, the type of data stored in the table evolves over time, the dictionary can be kept up to date using REORG.

Estimating savings

DB2 provides the INSPECT tool in order to help you determine the compression ratio estimate for a particular table or data set. This tool collects a sample of the table data, and builds a compression dictionary from it. This dictionary is then used to test compression against the records contained in the sample. From this test, compression savings are estimated.

INSPECT ROWCOMPESTIMATE TABLE NAME Sales … 
		RESULTS KEEP <filename>
	
db2inspf <filename> <outfile>

The output of the INSPECT tool needs to be formatted using a DB2 utility to see the results. The files are found in the db2dump directory. Sample output is illustrated below.

  DATABASE: TEST  
  VERSION : SQL09010 
  2005-12-01-13.12.21.232959

  Action: ROWCOMPESTIMATE TABLE
  Schema name: RSAHUJA 
  Table name: Sales 
  ... 
      Percentage of bytes saved from compression: 66 
      Compression dictionary size: 2176 bytes. 
  ...

Benefits of data row compression

"Venom" compression technology is capable of storage cost savings by up to 50% or more by reducing the amount of disk space (and disk sub-system peripherals) required for storing data. The size of database logs can also be reduced since DB2 compresses user data within log records.

This technology can also improve performance in some scenarios, despite compression/decompression entailing CPU overhead. Accessing data from the disk is the slowest database operation. By storing compressed data on disk, fewer I/O operations need to be performed on the disk to retrieve or store the same amount of data. Therefore, for disk I/O-bound workloads (for instance, when the system is waiting/idling for data to be accessed from the disk), the query processing time can be noticeably improved.

Furthermore, DB2 keeps the data compressed on both disk and memory (DB2 buffer pools), thereby reducing the amount of memory consumed, and freeing it up for other database or system operations. This can further improve database performance for queries and other operations.

Sample results

The amount of space savings using the data row compression feature in DB2 can vary depending on the data. Customers using beta versions of DB2 9 have reported savings in excess of 50% and up to 80% for certain large database installations. For one customer data set, a 179.9GB table using 32KB pages was reduced to only 42.5GB: a savings of 76.4%!

The following chart illustrates a few other examples of space savings with the DB2 data row compression feature on different tables:

Figure 2. Examples of space savings with DB2 data row compression
Examples of space savings with DB2 data row compression

According to Sean McCown of InfoWorld:

"The new compression method translates into direct storage savings at an average ratio of 45 to 75 percent, depending on the data type. To test this, I created a 40GB table (approximately 500 million rows) with mixed numeric and text data, exported it to a text file, and then imported it into a compressed table format. The size of the compressed table was about 17.75GB -- a savings of 56 percent."

Refer to the Resources section to link to the complete article.

Other forms of compression in DB2

Besides data row compression introduced as part of "Venom" technology, DB2 provides additional mechanisms for reducing storage requirements even further. These include:

NULL and default value compression

Available in DB2 for Linux, UNIX, and Windows since Version 8, with this type of compression no-disk storage is consumed for NULL values, zero length data in variable length columns and system default values.

Database backup compression

Available since Fixpak 4 of DB2 for Linux, UNIX, and Windows V8, this compression feature results in smaller backup images that not only reduce backup storage requirements, but also make it easier to move backups between systems.

XML tag substitution

The verbose nature of XML implies that XML fragments and documents typically consume a lot of disk space. DB2 9 stores XML data in a parsed hierarchical format, replacing tag names (for example, employee) with integer values (such as 4). Repeating occurrences of the same tags are assigned the same numeric values. Storing text -rich tags using integer values not only reduces space consumption but also assists with higher performance when querying data. Moreover, the XML tag parsing (replacing with integer values), like data row compression, is done under the covers and is completely transparent to users and applications.

Multidimensional clustering

A form of index compression has been available in DB2 for Linux, UNIX, and Windows since V8.1. Significant index space savings can be achieved through block indexes, where one key (or index entry) per thousands of records is used (rather than one key per record with traditional indexes).

Conclusion

Data-row compression in DB2 9 is a remarkable new feature that can result in substantial disk space and storage cost savings. Download DB2 9 and try out this feature for yourself to see how much you can save.

Resources

Learn

Get products and technologies

  • Download DB2 9 to try out the features described in this article.
  • Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edtion and provides a solid base to build and deploy applications.
  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=124068
ArticleTitle=Introducing DB2 9, Part 1: Data compression in DB2 9
publish-date=05242006