Question & Answer
Question
How is the GENERATE STATISTICS statement different after release 4.6?
Answer
There has been a change in the way the Netezza system collects the statistics it uses with the cost-based optimizer, the tool that calculates the most efficient plan for a query. Prior to 4.6, the GENERATE STATISTICS statement would generate full statistics on the target object, regardless of size. As of Netezza 4.6, the system will review the table's size and determine which type of statistics need to be collected--full, express, or dispersion. The system gathers full statistics on small tables and express statistics for medium tables. The JIT (just in time) dispersion feature is used on very large tables. Determining which size category a table belongs to is described below.
Understanding Statistic Gathering Techniques Based on Table Size
Large Tables
Release 4.6 introduced the JIT dispersion feature for large tables. JIT dispersion allows the optimizer to collect real-time information about column cardinality using sample scans rather than dispersion values calculated at some fixed interval. These sample scans (dispersion statistics) are far more accurate, as they take into account any restrictions applied to the table.
JIT dispersion applies only to large tables, those exceeding a user-configurable size. If a table is larger than the value configured with the Postgres setting JIT_DISP_MIN_ROWS, JIT dispersion is enacted. The default value for the setting is 500 million rows. You can review the current setting on your system by issuing the following command:
$ nzsql -c "show jit_disp_min_rows"
NOTICE: JIT_DISP_MIN_ROWS is 5e+08
SHOW VARIABLE
Because dispersion is now only calculated on-demand, there is no longer a need to calculate the dispersion of the columns in these very large tables when generating statistics.
Medium and Small Tables
Full statistics are now only collected for small tables. The system collects full statistics on a table whose rows number fewer than 10 times the total number of SPUs or less than the Postgres settingSAMPLE_STATS_MIN_ROWS (which defaults to 0). To modify this, you can set it at the session level:
set sample_stats_min_Rows = 1000000;
If you would prefer to make this a system-wide setting, add the following line to/nz/data/postgresql.conf:
#
sample_stats_min_Rows=1000000
#
The system collects express statistics for anything in between a small and a large table, a medium-sized table.
Understanding Statistic Types
New functionality aside, there are differences between full, dispersion, and express statistics, both in the way they are collected and the effect on the system. For example, express statistics collection is about four times faster than full. This is due primarily to the manner in which the cardinality of columns gets calculated. Dispersion statistics are not as accurate as full statistics, but are close approximations.
The table below summarizes the GENERATE STATISTICS behavior for each statistics type.
Attribute of Statistics | Full | Express | Dispersion |
All rows are read | True | True | True |
Each column's minimum value calculated | True | True | True |
Each column's maximum value calculated | True | True | True |
The number of NULL values are calculated | True | True | True |
Zone map information is refreshed | True | True | True |
The catalog's statistics are updated | True | True | True |
Number of columns processed in a single pass | 10 | 30 | Unbounded |
How is dispersion calculated? | True count of distinct values | Count of distinct hashed values | Not calculated at all |
Historical Number
NZ974537
Was this topic helpful?
Document Information
Modified date:
17 October 2019
UID
swg21571597