Understanding the GENERATE STATISTICS statement

Question & Answer

Question

How is the GENERATE STATISTICS statement different after release 4.6?

Answer

There has been a change in the way the Netezza system collects the statistics it uses with the cost-based optimizer, the tool that calculates the most efficient plan for a query. Prior to 4.6, the GENERATE STATISTICS statement would generate full statistics on the target object, regardless of size. As of Netezza 4.6, the system will review the table's size and determine which type of statistics need to be collected--full, express, or dispersion. The system gathers full statistics on small tables and express statistics for medium tables. The JIT (just in time) dispersion feature is used on very large tables. Determining which size category a table belongs to is described below.

Understanding Statistic Gathering Techniques Based on Table Size

Large Tables

Release 4.6 introduced the JIT dispersion feature for large tables. JIT dispersion allows the optimizer to collect real-time information about column cardinality using sample scans rather than dispersion values calculated at some fixed interval. These sample scans (dispersion statistics) are far more accurate, as they take into account any restrictions applied to the table.

JIT dispersion applies only to large tables, those exceeding a user-configurable size. If a table is larger than the value configured with the Postgres setting JIT_DISP_MIN_ROWS, JIT dispersion is enacted. The default value for the setting is 500 million rows. You can review the current setting on your system by issuing the following command:

$ nzsql -c "show jit_disp_min_rows"

NOTICE: JIT_DISP_MIN_ROWS is 5e+08

SHOW VARIABLE

Because dispersion is now only calculated on-demand, there is no longer a need to calculate the dispersion of the columns in these very large tables when generating statistics.

Medium and Small Tables

Full statistics are now only collected for small tables. The system collects full statistics on a table whose rows number fewer than 10 times the total number of SPUs or less than the Postgres settingSAMPLE_STATS_MIN_ROWS (which defaults to 0). To modify this, you can set it at the session level:

set sample_stats_min_Rows = 1000000;

If you would prefer to make this a system-wide setting, add the following line to/nz/data/postgresql.conf:

#

sample_stats_min_Rows=1000000

#

The system collects express statistics for anything in between a small and a large table, a medium-sized table.

Understanding Statistic Types

New functionality aside, there are differences between full, dispersion, and express statistics, both in the way they are collected and the effect on the system. For example, express statistics collection is about four times faster than full. This is due primarily to the manner in which the cardinality of columns gets calculated. Dispersion statistics are not as accurate as full statistics, but are close approximations.

The table below summarizes the GENERATE STATISTICS behavior for each statistics type.

Attribute of Statistics	Full	Express	Dispersion
All rows are read	True	True	True
Each column's minimum value calculated	True	True	True
Each column's maximum value calculated	True	True	True
The number of NULL values are calculated	True	True	True
Zone map information is refreshed	True	True	True
The catalog's statistics are updated	True	True	True
Number of columns processed in a single pass	10	30	Unbounded
How is dispersion calculated?	True count of distinct values	Count of distinct hashed values	Not calculated at all

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":null,"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Historical Number

NZ974537

Was this topic helpful?

Document Information

More support for:
IBM PureData System

Software version:
1.0.0

Document number:
463793

Modified date:
17 October 2019

UID

swg21571597

IBM Support

Tips