Database partitioning helps you to adapt to data growth by providing a way to expand the capacity of the system and scale for performance. A distribution key is a column (or group of columns) that is used to determine the database partition in which a particular row of data is stored. The following guidelines will help you to choose a distribution key.
- Choose the distribution key from those columns having the highest cardinality. Unique keys are good candidates. Columns with uneven data distribution or columns with a small number of distinct values might result in skew, where query processing involves more work on a subset of database partitions and less work on others.
- Choose the distribution key from columns with simple data types, such as integer or fixed length character; this will improve hashing performance.
- Choose the distribution key to be a subset of join columns to facilitate join collocation.
- Avoid choosing a distribution key with columns that are updated frequently.
- In an online transaction processing (OLTP) environment, ensure that all columns in the distribution key participate in transactions through equality predicates. This ensures that an OLTP transaction is processed within a single database partition and avoids the communication overhead inherent with multiple database partitions.
- Include columns that often participate in a GROUP BY clause in the distribution key.
- Unique index key columns must include all of the distribution key columns. The DB2 database manager creates unique indexes to support unique constraints and primary key constraints, and these system defined unique indexes have the same requirement that the constraint key columns must be a superset of the distribution key.
If you have any comments or questions for the authors of this best practice paper, feel free to log a comment on the paper's summary page and we will respond. You need to login with your IBM ID to be able to enter comments. Registering your Id is free and easy at developerWorks.