Choosing a hash distribution key for a table in an MPP database
Choosing an effective hash distribution key requires examining the table definition and any queries that are issued against the table.
Before you begin
To choose an effective hash distribution key, you need to collect information about the table and how the table is used:
- Table definition - primary key, unique key, generated columns, column data types
- Table data - number of distinct values, distribution of values
- Queries that are issued against the table - joins, equality predicates
About this task
In an MPP database, the system resources across all of the nodes in the cluster (resources such as memory, storage, and processors) are divided into logical units called data slices. When you use hash distribution, the database manager distributes data in the rows of the table across the data slices by applying a hashing algorithm to the values in the columns of the distribution key.
An effective hash distribution key can improve query performance in two different ways:
-
Spreading table data evenly across all of the data slices
An even distribution yields several benefits:- Maximizes parallelization of query processing
- Makes efficient use of available storage space
-
Collocating rows that are commonly fetched together from different tables
When rows that are commonly fetched together are located on the same data slice, the amount of time it takes to fetch those rows is reduced.
In general, prioritize even distribution over row collocation.
For smaller tables, the distribution key has less of an impact on query performance. For larger tables, the distribution key significantly affects query performance.
Procedure
Example
CREATE TABLE EMPLOYEE
(
ID INT NOT NULL,
NAME VARCHAR(250) NOT NULL,
RESUME CLOB,
HIRE_DATE DATE,
IS_MANAGER CHAR(1),
OFFICE_NUMBER INT
)
ORGANIZE BY ROW
Additional information:
- The value of the IS_MANAGER column can be only 'Y' or 'N'.
- Managers need to regularly issue queries against this table and another table, called SALES, to generate reports of the total sales by each employee:
... WHERE SALES.SALES_PERSON_ID = EMPLOYEE.ID ...
- The company is only 1 year old and 50% of employees were hired last July.
You might use the following logic to choose a hash distribution key for the EMPLOYEE table:
Criteria | Discussion | Columns |
---|---|---|
Table definition |
|
Remaining candidates:
|
Table data |
|
Remaining candidates:
|
Queries |
|
Recommended:
Remaining candidates:
|
General |
|
Distribution key: ( ID ) |