Choosing a hash distribution key for a table in an MPP database

Choosing an effective hash distribution key requires examining the table definition and any queries that are issued against the table.

Before you begin

To choose an effective hash distribution key, you need to collect information about the table and how the table is used:

  • Table definition - primary key, unique key, generated columns, column data types
  • Table data - number of distinct values, distribution of values
  • Queries that are issued against the table - joins, equality predicates

About this task

In an MPP database, the system resources across all of the nodes in the cluster (resources such as memory, storage, and processors) are divided into logical units called data slices. When you use hash distribution, the database manager distributes data in the rows of the table across the data slices by applying a hashing algorithm to the values in the columns of the distribution key.

An effective hash distribution key can improve query performance in two different ways:

  • Spreading table data evenly across all of the data slices

    An even distribution yields several benefits:
    • Maximizes parallelization of query processing
    • Makes efficient use of available storage space
  • Collocating rows that are commonly fetched together from different tables

    When rows that are commonly fetched together are located on the same data slice, the amount of time it takes to fetch those rows is reduced.

In general, prioritize even distribution over row collocation.

For smaller tables, the distribution key has less of an impact on query performance. For larger tables, the distribution key significantly affects query performance.

Procedure

Consider all of the columns of the table as candidates for the distribution key, and exclude columns or choose columns according to the following four criteria:
  1. Table definition
    • If the table has an enforced primary key, the columns in the distribution key must be a subset of the columns in the primary key.
    • If the table has an enforced unique key, the columns in the distribution key must be a subset of the columns in the unique key.
    • Exclude from consideration any generated column that is not supported as a distribution key:
      • ROW CHANGE TIMESTAMP
      • ROW BEGIN
      • ROW END
      • TRANSACTION START ID
    • Exclude from consideration any column that has a data type that is not supported in a distribution key:
      • Long (examples: long varchar, long vargraphic)
      • Large object (examples: BLOB, CLOB, DBCLOB)
      • XML
      • Distinct type based on long, large object, or XML data types
      • Structured type
  2. Table data
    • Choose columns that contain (or that you anticipate are likely to contain) many distinct and evenly distributed values.
    • Exclude from consideration columns that contain many NULL values.
  3. Queries
    • To collocate rows that are commonly fetched together from different tables, consider these factors for both tables:
      • If the table is included in joins, choose the columns that are used in the join predicates of the queries.
      • If the table is included in queries that have equality tests with columns of other tables, choose columns that are used in the equality tests.
  4. General
    • Choose as few columns as possible because that increases the possibility of collocating rows that are commonly fetched together from different tables.

Example

Consider a table, EMPLOYEE, with the following columns:

CREATE TABLE EMPLOYEE                    
(
  ID INT NOT NULL,
  NAME VARCHAR(250) NOT NULL,
  RESUME CLOB,
  HIRE_DATE DATE,
  IS_MANAGER CHAR(1),
  OFFICE_NUMBER INT
)
ORGANIZE BY ROW

Additional information:

  • The value of the IS_MANAGER column can be only 'Y' or 'N'.
  • Managers need to regularly issue queries against this table and another table, called SALES, to generate reports of the total sales by each employee:
    
    ... WHERE SALES.SALES_PERSON_ID = EMPLOYEE.ID ...
    
  • The company is only 1 year old and 50% of employees were hired last July.

You might use the following logic to choose a hash distribution key for the EMPLOYEE table:

Table 1. Logic for choosing a hash distribution key for the EMPLOYEE table. This table discusses how to apply four criteria for choosing a hash distribution key for the EMPLOYEE table.
Criteria Discussion Columns
Table definition
  • The RESUME column cannot be a part of the distribution key because its data type is not supported.

Remaining candidates:

  • ID
  • NAME
  • HIRE_DATE
  • IS_MANAGER
  • OFFICE_NUMBER
Table data
  • Because 50% of employees were hired last July, the values in HIRE_DATE are not evenly distributed across many distinct values.
  • The value in the IS_MANAGER column can be only 'Y' or 'N', which is not many distinct values.

Remaining candidates:

  • ID
  • NAME
  • OFFICE_NUMBER
Queries
  • Because the queries that are issued against this table compare the values in the ID column with values from a column in another table, the ID column should be part of the distribution key.

Recommended:

  • ID

Remaining candidates:

  • NAME
  • OFFICE_NUMBER
General
  • Because the general guidance is to include as few columns as possible, and because no information here indicates that either the OFFICE_NUMBER column or the NAME column is needed in the distribution key, these two columns are removed from consideration.

Distribution key: ( ID )