• Add a Comment
  • Edit
  • More Actions v
  • Quarantine this Entry

Comments (36)

1 AbhisekBasuMullick commented Permalink

Hi David,

I'm coming from Ab Initio background, which has similar MPP(shared nothing) architecture.
What happens if the table was created with distribution key order_id and after a month the business wants to perform an aggregation on the same table but now the group by key is region_code.
Now if it is a pure shared nothing architecture the data within the table has to be redistributed/re-partitioned across the SPUs on a temporary space or in memory based on region_code. The same should happen in case the table is created with random distribution.
Please clarify this.

2 Rushang commented Permalink

Hi David,

Thanks for the post... it is very helpful... i have question though...
what if we have latrget table say fact table and we disritibute it using two or three columns. does it help in following case?
Joining same column only on one column among three dist column?
Joining on two clumns among three dist column?
Joining on all three colums?
even our table is disrtibuted on three columns and if we use one of them for join does it help in that join case scenario too or it helps only in case where we use all three columns in our join..

3 DavidBirmingham commented Permalink

The distribution is not an index, it is a hash. So if you use three keys for a distribution it will hash the value of all three keys and distribute on that value. So you can see that every join on the table must use all three keys in order to achieve co-location. The answer may be functionally accurate but not co-located. In every case, the keys chosen for distribution MUST be used in the join in order to achieve co-location. This is true of one or several.

It is best by policy (honestly!) to just use one (or less) distribution keys and only use more than this if there is a stronger justification (such as too much skew) but you would make the decision for physical, not functional reasons. If one key on the table gives good distribution then there is no justifiable reason to add more columns to it. There's no value in it and it increases the chance that a key will be ignored and co-location lost.

4 AbhishekSakhuja commented Permalink

Hi David,

Thank you so much for such a nice article. I do have a question on distribution keys. For example, i have three columns which makes my data unique in a table with 10 million rows. Three columns are month_id (includes 4 months), panelist_id and brand_ids but most of the times i use only month_id and panelist_id to join with another table whereas brand_id would be ideal. So does that mean i need to distribute only on (month_id, panelist_id) but if i do so then it is possible that my table distribution will be highly skewed as brand_id column includes the major chunk of the data and without this column my table's data will have lots of discrepancies. So, how should i distribute? Do i need to take care only about the joins or it is more inclined towards an equal table distribution on disk slices.

5 DavidBirmingham commented Permalink

Firstly, don't ever use a month, date or anything associated with time as a distribution key. It is best to even avoid it as a secondary key. This is to avoid process skew. After all, if your query is for a single date and you are distributed on date, only one dataslice will participate. If you are distributed on something else, all dataslices will participate.

As for your month_id, panelist_id and brand_id, the question is, what are your other tables distributed on? If they are distributed on panelist_id, then use that. If on brand_id then use that. A primary goal is to get as many tables distributed on the same key as possible - with the understanding that you must use the key in the join. If the key does not appear in the join, you wont get the benefit.
The distribution key does not have to be unique. The objective is to (a) spread out the data as evening as possible and (b) col-locate records by sharing a common distribution key across tables.
For example, at one site they joined order-headers to orders, where the orders table was distributed on order_id. The order_header did not have an order_id but the order had an order-header id. In this case it was best to distribute on the order-header. In another case, a fact table had a variety of dimensions on it but the customer dimension was the largest and most active. By distributing the fact on the customer_id, the majority of queries will enjoy high performance.

6 AbhishekSakhuja commented Permalink

Hi David,

Thank you so much and this was very informative. Just to be clear on distribution, I have two more questions: 1. When most of the time i am joining my tables on panelist_id and month_id then don't you think I should distribute that on these composite keys? 2. Does Hash distribution on these composite keys will be something like (month_id || panelist_id - concatenated). If yes, then there is a high probability that single month will also be distributed over all the disk slices instead of one because it does have concatenated panelist_id with it and therefore skewness at the month level will be less?

7 AbhishekSakhuja commented Permalink

Hi David,
Sorry but one more question. Can we do the distribution on numeric as well as characters (composite distribution - example panelist_id and group_name) ? I know that won't help much to Zone maps because they use only numeric summary to extract the data from a particular SPU but per our joining requirements characters are also required to join the tables.

8 AbhishekSakhuja commented Permalink

Hi David,
Came up with another question. How about i have two tables one with very large data set (10 million records) and another one with very small data set (10 rows) then do you think distribution on joining conditions would be appropriate because such joining condition distribution will definitely skew the data slices for the large data set.

9 DavidBirmingham commented Permalink

The most efficient data type in Netezza is the bigint, this is because joining on this type is through a binary subtraction, usually on a CPU register, and can be 100 times more powerful than other types, especially a varchar.

If the varchar cannot be converted to integer (it's a real alphanumeric) I would suggest you use the hash8 (from the SQL toolkit) to convert it to an bigint. The hash8 has no danger of collision (e.g. "DAVID" will always hash to exactly the same value each time it's called) no matter how many billions of unique keys we have.
"Just because" you are joining on two keys does not 'require' them to be in the distribution. If the panelist_id gives you good distribution, adding the month to it will neither make the distribution better nor speed up the join. What it will do however, is constrain all of your queries - once you use more than one key you will always have to join with both keys in order to get co-location.
I have only seen one case (in 8 years!) where multiple keys were absolutely necessary because of skew. If we had to do this again, I would have told them to concat those keys into a hash8 bigint and use the single key as the join key. Having multiple keys in every join operation is messy and tedious.
The hash8 gets us out of requiring a cross-reference table (to align natural keys with permanent values) and it can be used on any table in any context and still get the same, consistent results. And it's a bigint, so will be a strong performer.

10 DavidBirmingham commented Permalink

Distribution for co-located joining is for large tables not small ones. For your refernence tables and lookups (e.g. thousands or tens of thousands of rows) make them random unless you can absolutely share the same distribution key.

This will cause the machine to always broadcast the smaller to the larger and never get confused about it. If you have a ten-row table with a distribution identical to the larger table, use the distribution. But you wont' be able to tell the difference. If this ten row table grows to ten million or ten billion, the distribution key will make a difference.

11 AbhishekSakhuja commented Permalink

I am hooked up with one thing that even though i have an identical distribution (column_name : ABC_ID) in Table A (10 million rows) against Table B (10 rows) . Data slices for table A will definitely skew whereas Table B will have perfect distribution. For such situation, I can understand that processing will work perfectly because only 10 SPU will be on play but how about if I have another table C (large data set) joined with Table A on month_id and panelist_id. Then do you think this could be the best distribution. For table A (month_id, panelist_id), Table B (On Random) and Table C (month_id, panelist_id).

12 AbhishekSakhuja commented Permalink

Adding to this, how would you treat a distribution on Table A joining with Table C (where Table C is an internal DB table and we are unaware about it's distribution keys) on month_id and panelist_id? Does distribution on Table A will help us on this case?

13 DavidBirmingham commented Permalink

When the table is small and the distribution is different, the machine will "broadcast" the table to the other CPUs by making it available in the memory of the "SPU", that holds 8 CPUs at a time. This is more efficient than the Mustang which would broadcast at the CPU level.

A ten-row table is irrelevant. Stop worrying about it :). Having 10-in-play for one record each is a non-issue. Having 10-in-play for millions of records is a big issue.
Don't use the monthid at all because this could process-skew your query (skew is more than physical). Only use more than one key in a distribution if the single key does not give you good distribution. Extra keys do not help performance.
All tables are "internal" DB tables. Temp tables, intermediate tables, base tables all reside in and are distributed on the MPP. What did you have in mind about a table not being an internal table?

14 AbhishekSakhuja commented Permalink

Oh, when i said distribution on (month_id, panelist_id) i mean to say that i have concatenated the two columns into a new column "mpid" and then making a perfect distribution on (mpid) because if i distribute only on panelist_id then the data may not distribute evenly and co-location wouldn't be faster. The reason, i am focusing on this point because there is a high probability that data may duplicate over months and therefore i felt if we cannot distribute using (month_id, panelist_id) then i should distribute it using a concatenated column (mpid). Previously, my perception on composite key was if i distribute using (month_id, panelist_id) then Netezza hash function will first concatenate the two column and then it will distribute but i think that is not the case, right?

By internal table, i mean to say that the table has been created by a third person and i am unaware about it's distribution.

15 DavidBirmingham commented Permalink

Well, when you say "may not distribute evenly" does this mean you don't already know? To find out is pretty easy:

create table mytable as select panelist_id from mytable distribute on (panelist_id);
And then check the distribution. One way to do this quickly is with an "eyeball check" -
select count(*) , datasliceid from mytable group by datasliceid order by datasliceid;
Now check the counts on these and see if they are "off" dramatically on any of the dataslices.
Choosing a distribution should not be based on guesswork.
Netezza will concat the two columns and then hash them. You don't need to do this separately. Keep in mind that if either of the fields are null then the whole result is null and it will hash null.
You can know the distribution of any table by accessing the system view for _V_DISTKEYS or something similar. EIther way, you should not build a solution without knowing the distribution of the tables involved. With large-scale data solutions, such things are not an option.

Add a Comment Add a Comment