Best practices for managing data growth in an operational database (Updated September 2019)
The best practice paper Best practices for managing data growth in an operational database provides a wealth of recommendations to help you design and manage a database environment for efficient data growth, including tips on how to choose the right distribution key for a partitioned database.
Database partitioning helps you adapt to data growth by providing a way to expand the capacity of the system and scale for performance. A distribution key is a column (or group of columns) that is used to determine the database partition in which a particular row of data is stored. The following guidelines will help you choose a distribution key.
- Choose the distribution key from those columns having the highest cardinality. Unique keys are good candidates. Columns with uneven data distribution or columns with a small number of distinct values might result in a skew, where query processing involves more work on a subset of database partitions and less work on others.
- Choose the distribution key from columns with simple data types, such as integer or fixed length character; this will improve hashing performance.
- Choose the distribution key to be a subset of join columns to facilitate join collocation. A common practice is to use the primary key of the largest frequently joined dimension table and the corresponding foreign key on the fact table.
- Avoid choosing a distribution key with columns that are updated frequently.
- In an online transaction processing (OLTP) environment, ensure that all columns in the distribution key participate in transactions through equality predicates. This ensures that an OLTP transaction is processed within a single database partition and avoids the communication overhead inherent with multiple database partitions.
- Include columns that often participate in a GROUP BY clause in the distribution key.
- Unique index key columns must include all of the distribution key columns if the key constraint is defined as enforced. The Db2 database manager creates unique indexes to support enforced unique constraints and primary key constraints, and these system-defined unique indexes have the requirement that the constraint key columns must be a superset of the distribution key.