IBM PureData-Netezza Developer Network (NDN)
DeepashriKrishnaraja 270001C7Y3 Tags:  join broadcast plan jit execution database netezza generate statistics performance nzadmin optimizer 3 Comments 6,562 Views
Imagine that you have embarked on a journey and all that you know is the destination that you have to reach with no clue of the directions to reach it..At this point, your savior is the GPS that will give you the updated directions which will aid you in reaching the destination in the shortest possible route.. What GPS does to directions is what statistics will do to the Netezza optimizer…Confused??
Let me explain…The Netezza Optimizer which is the primary reason for the enhanced performance of Netezza internally creates an execution plan for any query you wish to execute. The main input to this Optimizer along with the actual query is the statistics. If you have the right updated statistics your optimizer will create the most optimal query execution plan resulting in improved performance. Now connecting our example above, statistics is like your GPS which gives you the right directions (optimal query plan) resulting in high performance (destination).
In this context many questions come to one’s mind. What are statistics? Types of statistics and who generates them? What do statistics contain? When and how are the statistics generated? Let me answer them one by one…
In the literary sense, database statistics are basically dynamic metadata which assist the query optimizer in making better decisions. User generated statistics are the most efficient form of statistics which is obtained by explicitly executing the nzsql GENERATE STATISTICS command or through NzAdmin GUI Database View . GENERATE STATISTICS reads every row in the table to determine the total number of rows in the table plus each table column’s maximum and minimum values,dispersion values, duplicate values, null values and unique values. It provides the most accurate and highest quality statistics enabling the Optimizer to choose the best execution plan from all possible plans. Statistics can be generated on all the columns for all the tables of the database(GENERATE STATISTICS),on all the columns for a specific table(GENERATE STATISTICS ON <table>) or on a set of columns for a specific table(GENERATE STATISTICS ON <table> (<col>, <col>,…).Stats are updated in system catalog tables.
Some can and cannot's you should keep in mind :
Along with the user, Netezza database has its own ways of generating and maintaining database and dispersion statistics when certain database operations are performed. Of course, these automatic statistics are not as accurate as the ones generated by GENERATE STATISTICS command, however when you perform some database operations like CREATE TABLE AS,INSERT and UPDATE Operations,GROOM TABLE /nzreclaim, TRUNCATE TABLE you can be a little relieved even if you have forgotten to give the GENERATE STATISTICS command as the database has automatically generated it for you. Don’t be surprised to see the output of GENERATE STATISTICS even if you have not run it, as Netezza has the capability of running this for small tables with fewer rows automatically.
Well, the Netezza optimizer cannot be left behind. The optimizer creates Just In Time (JIT) statistics on the fly during planning using sample scan functionality and zone map information. Based on estimations , JIT statistics improve selectivity performance for tables with data skew or complex column restrictions,and also avoid broadcast of large tables that were estimated to be small.
What are the other conditions when the Optimizer automatically runs JIT statistics?
– Tables that contain more than five million records
– Queries that contain at least one column restriction
– Restrictions that do not contain sub-query or sub-plan expressions
– Tables that participate in a join or have an associated materialized view
You might be wondering, that with JIT collecting statistics so intensively why do you need to run GENERATE STATISTICS?... JIT Stats do not compute dispersion values which are utmost necessary to compute JOIN selectivity and also do not run on system tables, external tables, or virtual tables. GENERATE STATISTICS on the other hand, runs on any kind of table along with the usual database user table. Always remember, if you don't have the updated directions to your destination chances are higher that you make the wrong choices..similarly to gain the highest performance you need the best query plan and that is only possible with the most accurate user generated statistics.
Here are some tips on when you should generate statistics which ensure you high performance :
At the end of the day, use the right statistics and enjoy the Netezza performance boost!!!
DeepashriKrishnaraja 270001C7Y3 Tags:  collocated key netezza distribution slice skew redistribute unique ampp data distribute random performance join response hash time 4,859 Views
Performance, thy name is Netezza. One of the key factors which contribute to this aspect is the intelligent data distribution owing to the AMPP architecture of Netezza. This efficient data distribution is determined by a good distribution key.
First of all why do we need an efficient data distribution? When the system creates records, it assigns them to a logical data slice based on their distribution key value. When you create a table and then load the data into the system, the rows of the table will be distributed among all the data slices. If some data slices have more rows of a table than others, the data slices with more data and the S-Blades that manage them have to work harder, longer, and need more resources and time to complete their jobs. These data slices and the S-Blades that manage them become a performance bottleneck for your queries.
Netezza system follows two distribution methods, hash and random (round robin method).Distribution key comes into picture in the hash distribution method. For every table, the distribution key is hashed to a numerical value. All rows with the same hash value are distributed to the same data slice.
Now consider the below example of a table for students of a class
CREATE TABLE sample (idnum integer, gender Boolean, Student-Name varchar(50)) distribute on gender;
Here the rows corresponding to gender column can have only 2 values - Male(M) or Female(F).So the system would map these rows with same values to the same hash values and in turn to the same data slices. So here we would end up distributing rows to only two data slices resulting in a data skew.
Now consider another example of a table for the new-joinees of a company with their joining date
CREATE TABLE sample1 (idnum integer, joiningdate date, Varchar(50) name) distribute on joiningdate;
Say 60% of the new-joinees joined in the month of June. With DATE as distribution key, rows may be distributed evenly across all S-Blades. However, most analysis (queries) is performed on a date range. In this example, all of the records to be processed for a given date range(June) are located on a single or a few data slices resulting in processing skew on that data slice.
Response time of a query is affected by the completion time for all of the data slices. In both the above cases, even though some data slices would have finished processing their data, the others with data and processing skews increase the response time thus degrading the performance.
So now the question is how do I choose a good distribution key? Here are some tips
You might ask me that with so many conditions, why do I need a distribution based on distribution key? Why not distribute using the RANDOM distribution method? It easily distributes data evenly across data slices using a random algorithm and hurray my objective is achieved..Well behold!!! We have a problem. What would be the situation when we have 2 tables joined on a common key with data distributed across several different data slices…Can you imagine the response time when the query which does the join will have to fetch data from these hundreds of data slices!! This is the catch in the random distribution method where an efficient distribution key in the hash distribution method gains the upper hand.
Collocated joins are the next front-runners for increased performance. For tables that have a relationship and are commonly joined (dimension tables and fact table of a star schema for example) always use the join keys as distribution keys. This will result in Collocated Joins where data being joined from both tables will reside in the same data slice thus reducing the data fetch cycles leading to higher degrees of performance.
In some cases it may not be possible to distribute both tables on the relationship key. If your database design is such that only one of the tables can be distributed by the join key you still don’t need to worry. Netezza will redistribute the needed columns of the other table to the suitable data slices to enable a collocated join. This process named as single redistribute will have a performance cost associated with it but is way ahead better than random distribution.
Few more hash distribution methods which boost performance in joins are double redistribute( when both tables cannot be distributed on join key) and broadcast (small tables are broadcasted to large tables when both joined tables have different distribution keys) which again rely that your distribution key is just the right one…Remember!! If your foundation is not right, your building cannot stand up tight.
Here are some good practices which one should never forget when setting the right distribution key.
Many a times we come across situations where in we want to accomplish a task for which there is an already built in function or a procedure. Most of the database vendors provide numerous such functions and procedures to achieve such tasks. But there are also occasions that none of the existing functions meet the actual goal that one desires. In such cases a user ends up in writing a function. Herein this article, I try to highlight as to how one can come up with his/her own C++ function which can be invoked using the Netezza SQL.
A user can create custom functions, aggregates, and shared libraries using the user-defined extensions feature. The custom functions can be run on the Netezza Performance Server (NPS) systems and used to perform specific types of analysis for business reporting and data queries. These objects leverage the massively parallel processing (MPP) environment to accelerate analysis of data. User-defined functions enable data processing directly on the system, leading to elimination or reduction of data movement to other systems for analysis, which ensures performance gains.
In other words, User defined Extensions (UDX) enhances Netezza functionality by enabling a user to write a C++ coded modules which can be invoked using the SQL queries.
There different types of UDX’s supported by Netezza are –
A typical UDX development cycle will include –
Let’s take a look on how UDF is created and used. UDF takes 0 or more input parameters and returns one output value. It is called once for every row and can be used in all SQL queries where built in functions are supported.
How to create the C++ file for UDF?
The file must include the header file udxinc.h in addition to any standard header file. The UDX classes and functions for API version 2 are defined in a namespace called nz:udx_ver2.
using namespace nz::udx_ver2;
Create a class object derived from the udf base class. The code will look something like this :
virtual nz::udx_ver2::ReturnValue evaluate()
// Function Logic
The instantiate method creates the object dynamically.
This CPP file will have to be compiled using the nzudxcompile utility.
The following object files are created
Once the UDF is compiled, it needs to be registered with the NPS system. The same utility can be used
Alternatively, the UDF can be registered using SQL, For ex –
These registered functions can be invoked like any function in a SQL statement.
In it's simplest terms, a Zone Map is a persistent table maintained by the Netezza system that contains information about the data in a table created by a user. A single row entry in the Zone map table would correspond to an extent (Storage) of the User table. By default, Netezza would record information about integeral and temporal columns from the user's table in it's corresponding zone map. For each qualified column, Netezza would maintain the maximum and minimum value of the column that is stored in an extent (3MB).
Other than integral and temporal types, Netezza also maintains character column info (first 8 bytes) in zone maps if that column is mentioned in "order by" clause of a materalized view. If your table has hundreds of columns, remember that Netezza would maintain zone map info only for the first 200 qualified columns (its configurable).
The debug info would be written into the following virtual tables which you could query (thanks to Kapil Maheshwari for guiding me to the right tables) :
For older systems, this info could be accessed through the following virtual tables :