Imagine that you have embarked on a journey and all that you know is the destination that you have to reach with no clue of the directions to reach it..At this point, your savior is the GPS that will give you the updated directions which will aid you in reaching the destination in the shortest possible route.. What GPS does to directions is what statistics will do to the Netezza optimizer…Confused??
Let me explain…The Netezza Optimizer which is the primary reason for the enhanced performance of Netezza internally creates an execution plan for any query you wish to execute. The main input to this Optimizer along with the actual query is the statistics. If you have the right updated statistics your optimizer will create the most optimal query execution plan resulting in improved performance. Now connecting our example above, statistics is like your GPS which gives you the right directions (optimal query plan) resulting in high performance (destination).
In this context many questions come to one’s mind. What are statistics? Types of statistics and who generates them? What do statistics contain? When and how are the statistics generated? Let me answer them one by one…
In the literary sense, database statistics are basically dynamic metadata which assist the query optimizer in making better decisions. User generated statistics are the most efficient form of statistics which is obtained by explicitly executing the nzsql GENERATE STATISTICS command or through NzAdmin GUI Database View . GENERATE STATISTICS reads every row in the table to determine the total number of rows in the table plus each table column’s maximum and minimum values,dispersion values, duplicate values, null values and unique values. It provides the most accurate and highest quality statistics enabling the Optimizer to choose the best execution plan from all possible plans. Statistics can be generated on all the columns for all the tables of the database(GENERATE STATISTICS),on all the columns for a specific table(GENERATE STATISTICS ON <table>) or on a set of columns for a specific table(GENERATE STATISTICS ON <table> (<col>, <col>,…).Stats are updated in system catalog tables.
Some can and cannot's you should keep in mind :
You must be the owner of the database or table or have the GenStats permission to execute GENERATE STATISTICS.
You cannot run the GENERATE STATISTICS command inside a transaction block (BEGIN/COMMIT pair).
Don't try to fetch zone-map information online during Generate Statistics as zone maps are disabled at that time.
Along with the user, Netezza database has its own ways of generating and maintaining database and dispersion statistics when certain database operations are performed. Of course, these automatic statistics are not as accurate as the ones generated by GENERATE STATISTICS command, however when you perform some database operations like CREATE TABLE AS,INSERT and UPDATE Operations,GROOM TABLE /nzreclaim, TRUNCATE TABLE you can be a little relieved even if you have forgotten to give the GENERATE STATISTICS command as the database has automatically generated it for you. Don’t be surprised to see the output of GENERATE STATISTICS even if you have not run it, as Netezza has the capability of running this for small tables with fewer rows automatically.
Well, the Netezza optimizer cannot be left behind. The optimizer creates Just In Time (JIT) statistics on the fly during planning using sample scan functionality and zone map information. Based on estimations , JIT statistics improve selectivity performance for tables with data skew or complex column restrictions,and also avoid broadcast of large tables that were estimated to be small.
What are the other conditions when the Optimizer automatically runs JIT statistics?
– Tables that contain more than five million records
– Queries that contain at least one column restriction
– Restrictions that do not contain sub-query or sub-plan expressions
– Tables that participate in a join or have an associated materialized view
You might be wondering, that with JIT collecting statistics so intensively why do you need to run GENERATE STATISTICS?... JIT Stats do not compute dispersion values which are utmost necessary to compute JOIN selectivity and also do not run on system tables, external tables, or virtual tables. GENERATE STATISTICS on the other hand, runs on any kind of table along with the usual database user table. Always remember, if you don't have the updated directions to your destination chances are higher that you make the wrong choices..similarly to gain the highest performance you need the best query plan and that is only possible with the most accurate user generated statistics.
Here are some tips on when you should generate statistics which ensure you high performance :
Tables change significantly (more than 5-10%) due to loads and INSERT, UPDATE, and DELETE commands.
Tables used in complicated joins that span three or more tables
Columns that are used in a JOIN condition, WHERE clause, SORT clause, GROUP BY/HAVING clause as updating statistics is more important for these columns.
Temporary tables, that contain a large number of rows and are used in joins
At the end of the day, use the right statistics and enjoy the Netezza performance boost!!!