IBM Support

Understanding the GENERATE STATISTICS statement

Question & Answer


Question

How is the GENERATE STATISTICS statement different after release 4.6?

Answer

There has been a change in the way the Netezza system collects the statistics it uses with the cost-based optimizer, the tool that calculates the most efficient plan for a query. Prior to 4.6, the GENERATE STATISTICS statement would generate full statistics on the target object, regardless of size. As of Netezza 4.6, the system will review the table's size and determine which type of statistics need to be collected--full, express, or dispersion. The system gathers full statistics on small tables and express statistics for medium tables. The JIT (just in time) dispersion feature is used on very large tables. Determining which size category a table belongs to is described below.

Understanding Statistic Gathering Techniques Based on Table Size


Large Tables

Release 4.6 introduced the JIT dispersion feature for large tables. JIT dispersion allows the optimizer to collect real-time information about column cardinality using sample scans rather than dispersion values calculated at some fixed interval. These sample scans (dispersion statistics) are far more accurate, as they take into account any restrictions applied to the table.

JIT dispersion applies only to large tables, those exceeding a user-configurable size. If a table is larger than the value configured with the Postgres setting JIT_DISP_MIN_ROWS, JIT dispersion is enacted. The default value for the setting is 500 million rows. You can review the current setting on your system by issuing the following command:


$ nzsql -c "show jit_disp_min_rows"
NOTICE: JIT_DISP_MIN_ROWS is 5e+08
SHOW VARIABLE



Because dispersion is now only calculated on-demand, there is no longer a need to calculate the dispersion of the columns in these very large tables when generating statistics.

Medium and Small Tables

Full statistics are now only collected for small tables. The system collects full statistics on a table whose rows number fewer than 10 times the total number of SPUs or less than the Postgres settingSAMPLE_STATS_MIN_ROWS (which defaults to 0). To modify this, you can set it at the session level:

set sample_stats_min_Rows = 1000000;

If you would prefer to make this a system-wide setting, add the following line to/nz/data/postgresql.conf:


#
sample_stats_min_Rows=1000000

#



The system collects express statistics for anything in between a small and a large table, a medium-sized table.

Understanding Statistic Types

New functionality aside, there are differences between full, dispersion, and express statistics, both in the way they are collected and the effect on the system. For example, express statistics collection is about four times faster than full. This is due primarily to the manner in which the cardinality of columns gets calculated. Dispersion statistics are not as accurate as full statistics, but are close approximations. 

The table below summarizes the GENERATE STATISTICS behavior for each statistics type. 



Attribute of Statistics
Full
Express
Dispersion
All rows are read
True 
True  
True 
Each column's minimum value calculated 
True  
True 
True 
Each column's maximum value calculated
True  
True  
True 
The number of NULL values are calculated
True 
True  
True 
Zone map information is refreshed
True  
True 
True  
The catalog's statistics are updated 
True  
True  
True  
Number of columns processed in a single pass
10 
30
Unbounded
How is dispersion calculated?True count of distinct values 
Count of distinct hashed values 
Not calculated at all 

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":null,"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Historical Number

NZ974537

Document Information

Modified date:
17 October 2019

UID

swg21571597