Performance, thy name is Netezza. One of the key factors which contribute to this aspect is the intelligent data distribution owing to the AMPP architecture of Netezza. This efficient data distribution is determined by a good distribution key.
First of all why do we need an efficient data distribution? When the system creates records, it assigns them to a logical data slice based on their distribution key value. When you create a table and then load the data into the system, the rows of the table will be distributed among all the data slices. If some data slices have more rows of a table than others, the data slices with more data and the S-Blades that manage them have to work harder, longer, and need more resources and time to complete their jobs. These data slices and the S-Blades that manage them become a performance bottleneck for your queries.
Netezza system follows two distribution methods, hash and random (round robin method).Distribution key comes into picture in the hash distribution method. For every table, the distribution key is hashed to a numerical value. All rows with the same hash value are distributed to the same data slice.
Now consider the below example of a table for students of a class
CREATE TABLE sample (idnum integer, gender Boolean, Student-Name varchar(50)) distribute on gender;
Here the rows corresponding to gender column can have only 2 values - Male(M) or Female(F).So the system would map these rows with same values to the same hash values and in turn to the same data slices. So here we would end up distributing rows to only two data slices resulting in a data skew.
Now consider another example of a table for the new-joinees of a company with their joining date
CREATE TABLE sample1 (idnum integer, joiningdate date, Varchar(50) name) distribute on joiningdate;
Say 60% of the new-joinees joined in the month of June. With DATE as distribution key, rows may be distributed evenly across all S-Blades. However, most analysis (queries) is performed on a date range. In this example, all of the records to be processed for a given date range(June) are located on a single or a few data slices resulting in processing skew on that data slice.
Response time of a query is affected by the completion time for all of the data slices. In both the above cases, even though some data slices would have finished processing their data, the others with data and processing skews increase the response time thus degrading the performance.
So now the question is how do I choose a good distribution key? Here are some tips
- A column with many as many distinct values which will ensure you better hash distribution
- Set of columns which you frequently use to retrieve rows from the table
- As few columns as possible to optimize generality of the selection.
- Columns based on equality search. For example If 2 tables are distributed on their equality columns join is performed locally resulting in major performance gains.
- Never use Boolean keys
You might ask me that with so many conditions, why do I need a distribution based on distribution key? Why not distribute using the RANDOM distribution method? It easily distributes data evenly across data slices using a random algorithm and hurray my objective is achieved..Well behold!!! We have a problem. What would be the situation when we have 2 tables joined on a common key with data distributed across several different data slices…Can you imagine the response time when the query which does the join will have to fetch data from these hundreds of data slices!! This is the catch in the random distribution method where an efficient distribution key in the hash distribution method gains the upper hand.
Collocated joins are the next front-runners for increased performance. For tables that have a relationship and are commonly joined (dimension tables and fact table of a star schema for example) always use the join keys as distribution keys. This will result in Collocated Joins where data being joined from both tables will reside in the same data slice thus reducing the data fetch cycles leading to higher degrees of performance.
In some cases it may not be possible to distribute both tables on the relationship key. If your database design is such that only one of the tables can be distributed by the join key you still don’t need to worry. Netezza will redistribute the needed columns of the other table to the suitable data slices to enable a collocated join. This process named as single redistribute will have a performance cost associated with it but is way ahead better than random distribution.
Few more hash distribution methods which boost performance in joins are double redistribute( when both tables cannot be distributed on join key) and broadcast (small tables are broadcasted to large tables when both joined tables have different distribution keys) which again rely that your distribution key is just the right one…Remember!! If your foundation is not right, your building cannot stand up tight.
Here are some good practices which one should never forget when setting the right distribution key.
- Never combine columns in a multi-column distribution, simply in-order to generate your own unique key.
Never add (fabricate) columns to your table and generate unique data values
(i.e. homegrown serial algorithm) in an attempt to generate your own unique distribution key
- Do not distribute on Floating Point and Boolean data types
- Ensure that distribution key columns for joined tables are of the same data type to avoid redistribution.
- Run the command (SELECT datasliceid, COUNT(datasliceid) AS "Rows“ FROM table_name GROUP BY datasliceid ORDER BY "Rows";) to keep a check on the data skew so that you can be sure of an efficient distribution key.
- Do not DISTRIBUTE ON RANDOM because it is the easy choice