Distribution keys

You can choose between two methods to distribute table rows to the worker nodes of your accelerator: random distribution and distribution using a distribution key.

Random distribution
A circular or revolving distribution of the rows in an accelerator-shadow table. Each data slice, which is the portion of data that is processed by a single worker node, is assigned an equal number of table rows starting with the first slice, then moving on to the next, and so on. After table rows have been assigned to the last unit in the chain, the next chunk of rows is distributed to the first again. This process continues until all table rows have been distributed.
Use of a distribution key (hash partitioning)
You can specify up to four columns that make up a distribution key. The accelerator uses a hash function on the key columns to determine the data slice or worker node that receives a table row. Rows with equal hash values are distributed to the same worker node.

By default, random distribution is used to distribute table rows to worker nodes for query processing. That is, all tables are evenly distributed among the existing worker nodes.

Impact of the distribution key on the query performance

Parallel processing works best if the table rows to be processed are evenly distributed across the worker nodes. If the rows of a table that is referenced by a query are unevenly distributed, the worker nodes with fewer rows have to wait for the node with the biggest number of rows to finish processing. However, if your queries involve joins of big tables (100 million to billions of rows), and the rows of these accelerator-shadow tables are scattered across the worker nodes even though they are evenly distributed in terms of quantity, you can easily spent much more time on data redistributions and broadcasts than you actually gain by the advantage of an even distribution.

Impact of distribution key on join performance

When two tables are joined, the query runs fastest if the matched rows of both accelerator-shadow tables reside on the same worker node (colocated join). You can instigate colocated joins by selecting the join columns as the distribution key for both tables.

However, if an accelerator-shadow table F1 is joined with two or more other tables D1, ... Dn using different join keys, you can only locate one of the pairs (F1-D1, F1-D2, ..., F1-Dn) on the same node because you can only select one of the join columns in table F1 as the distribution key for F1. In this case, it is best to colocate the biggest tables.

Best practices for selecting distribution keys

Follow these guidelines when using distribution keys:
  • For fact tables and the largest dimension tables, specify a distribution key. For all other tables, start with the default (random distribution).
  • Prefer single-column distribution keys to multiple-column keys.
  • Select a column as the distribution key that distributes the table rows as evenly as possible across the worker nodes, that is, select primary keys or columns with a unique index. Do not specify columns that only have a small number of distinct values. For example, Boolean values like male and female would place the table rows on two of the worker nodes only.
  • When joining two large tables, try to colocate the tables by specifying the join columns as the distribution key for both tables. Note that to instigate a colocated join, the data types of the join columns in both tables must match.
  • In general, do not specify a date, time, or timestamp column as the distribution key. Such keys would also distribute the table rows evenly across the worker nodes, but many queries restrict the number of result rows by a range predicate on the date, time, or timestamp column. For example, a query retrieving revenue figures from last month only would be processed on a single worker node if a column MONTH was selected as the distribution key because rows with equal values in this column would all be assigned to the same data slice.
  • Although adverse to the goal of faster query acceleration, a distribution key on the primary key or unique constraint of a table helps if you must update tables continually. Therefore, decide what is more important to you: query performance or the latency of incremental updates. The accelerator automatically picks a distribution key when you enable incremental updates for a table unless you have defined a distribution key by yourself.