ANALYZE DATALAKE TABLE statement
Use the ANALYZE TABLE statement to gather statistics for tables. These statistics are used by the optimizer to determine the most optimal access plans to efficiently process your queries. The more statistics that you collect on your tables, the better decisions the optimizer can make to provide the best possible access plans. When you run ANALYZE TABLE on a table to collect these statistics, the query generally runs faster.
- Table level:
- You can gather statistics about table level characteristics, such as the number of records.
- Column level:
- You can gather statistics about your columns, such as the number of distinct values. You can also gather statistics for column groups, which is useful if columns have a relationship.
Authorization
- CONTROL privilege on the table.
The creator of the table automatically has this privilege. For all other users, the CONTROL privilege must be granted to the user. Granting the CONTROL privilege through a group or a role is not sufficient. ACCESSCTRL or SECADM authority is required to grant the CONTROL privilege.
Syntax
Description
-
table-name
- The name of the table that you want to analyze.
- COMPUTE STATISTICS
- Gathers statistics. You can include the following options:
- analyze-col
-
- INCREMENTAL
- Only partitions that do not have updated statistics are scanned.
- FULL
- For a partitioned table, this value results in a full scan of the table to collect statistics. On a non-partitioned table, the scan is always a FULL scan.
- NOSCAN
-
When you specify the optional parameter NOSCAN, there is some performance improvement in the ANALYZE TABLE statement because ANALYZE TABLE does not scan files. By using NOSCAN, ANALYZE TABLE gathers only the following statistics:
- Number of files
- Table size in bytes
- Number of partitions
- FOR COLUMNS
- Table statistics and column-level statistics are gathered for the columns that you specify. You
must specify at least one column, or one column group as a parameter in FOR COLUMNS, or ANALYZE
TABLE returns a syntax error.
- colobj
- You can specify columns, column groups, or both. Separate each column group or column name with
a comma. Enclose column groups within parentheses. For a column group, only the number of distinct
values is gathered on the grouping of columns. You can intermix individual columns and column
groups. For example,
...FOR COLUMNS (c1,c3),c2,(c4,c5),c7,c8...;
- FOR ALL COLUMNS
- Table statistics and column-level statistics are gathered for all columns of the table. This
option is used when ANALYZE TABLE is automatically triggered.
- colgroup
An optional list of column groups only can be included between the COLUMNS and TABLESAMPLE (optional) clauses. Any individual columns that you specify trigger a syntax error.
Enclose column groups within parentheses. For a column group, only the number of distinct values is gathered on the grouping of columns.
- TABLESAMPLE SYSTEM | BERNOULLI (numeric-literal)
-
- SYSTEM
-
The term splits means the division of work that Db2® generates to compute data in parallel, which can vary according to file type (such as text file or parquet), table type (DATALAKE), and configuration settings. These sample statistics are then used to extrapolate the statistics of the entire table. The numeric-literal is the target percentage of splits to scan during ANALYZE TABLE. Therefore, if the value of the numeric-literal is 10, it might mean that 10% of the splits are sampled. For example, if the table has data that resides on 10 splits, 1 split is used in the sample. However, if the table is small enough and resides on 2 splits, then
TABLESAMPLE SYSTEM (10)
scans 1 split, which is about 50% of the table. ANALYZE TABLE makes adjustments for small tables so that statistics are extrapolated correctly. A default sample size of 10% is used when ANALYZE TABLE is automatically triggered in Db2. This value speeds up the performance of the ANALYZE TABLE statement with little impact on query performance. - BERNOULLI
-
This parameter is supported only for statistical views. You use this parameter to collect statistics on a sample of the rows from the statistical view rather than all of the rows. The numeric-literal is the target percentage of rows to analyze.
Statistical views can be very large because they can potentially describe join operations between multiple large tables. When you use this parameter, you can significantly reduce the time it takes to run ANALYZE TABLE for a statistical view.
Bernoulli sampling considers each row individually, including the row with probability P/100 (where P is the value of the numeric-literal ) and excluding it with probability 1-P/100. Therefore, if the value of the numeric-literal is the value 10, which represents a 10% sample, each row is included with a probability of 0.1, and is excluded with a probability of 0.9.
Usage notes
- Results are written to the Db2 stats catalogs.
- You must include at least one column name in the ANALYZE TABLE statement if you specify FOR COLUMNS.
-
When you run the ANALYZE TABLE statement against a table on a set of columns, and then later run ANALYZE TABLE on a second set of columns, the statistics that are gathered from the first ANALYZE TABLE statement are merged with the statistics that are gathered from the second ANALYZE TABLE statement. Therefore, if you decide that you need to run ANALYZE TABLE on additional columns after the first ANALYZE TABLE statement is run, be sure to run ANALYZE TABLE on the second set of columns only. This practice speeds up the time that it takes ANALYZE TABLE to complete.
You can modify a property in Db2 that disables or re-enables cumulative statistics. For instructions on turning cumulative statistics on or off, see Enabling or disabling cumulative statistics collection.
- When a table increases in size significantly, the statistics that were previously gathered become out of date. As a result, ANALYZE TABLE must be run for optimal query performance. automatically checks to see whether changed significantly, and if so, ANALYZE TABLE is run automatically.
- Although the memory footprint of the ANALYZE TABLE statement has been reduced considerably, here
are some ways to reduce the footprint even further:
- If you do not require distribution statistics for columns, turn them off by setting the
following properties to zero (0):
- biginsights.stats.hist.num
- The number of histogram
buckets.
CALL SYSHADOOP.SET_DATALAKE_CONFIG('BIGSQL', 'biginsights.stats.hist.num', '0');
- biginsights.stats.mfv.num
- The number of most frequent
values.
CALL SYSHADOOP.SET_DATALAKE_CONFIG('BIGSQL', 'biginsights.stats.mfv.num', '0');
- Because automatic ANALYZE TABLE statements are run against all columns in a table, in cases
where the table has a lot of columns, the storage and memory requirements increase when distribution
and MFV statistics are collected. By default, if a table has over 50 columns, any ANALYZE TABLE
statement that specifies the FOR ALL COLUMNS clause does not collect distribution and MFV
statistics. If you want to increase or decrease this value, toggle it by setting the following
property:
CALL SYSHADOOP.SET_DATALAKE_CONFIG('BIGSQL', 'biginsights.stats.wide.table.min.columns', '50');
- If you do not require distribution statistics for columns, turn them off by setting the
following properties to zero (0):
- It is a good idea to gather statistics on all tables that are used in your queries. Collect column-level statistics for all columns that your predicates reference, including join predicates. Collect column group statistics for all sets of related columns, such as (country, city), that your predicates reference.
- You can query the SYSCAT.TABLES catalog view to determine the total number of partitions, the number of files, and the total size (in kilobytes) of a table, if you have run the ANALYZE TABLE statement.
- Using the TABLESAMPLE SYSTEM clause might cause more data to be scanned than was requested if the number of splits is small. For example, if the table has only one split, the entire table is analyzed. If 10% sampling is requested and the number of splits is 2, approximately 50% (one split) of the table data is scanned.
- When FOR ALL COLUMNS is specified on a partitioned table with more than one partitioning column, and no column groups are included, statistics are automatically collected on a column group that contains all of the partitioning columns.
- Concurrent analyze operations on the same table are not supported. If you start an analyze operation while another one is in progress on the same table, the second operation is canceled and a warning message is returned.
Examples
- Example 1: Analyze a non-partitioned table.
This statement gathers statistics for Table2, along with column statistics for c3 and c4 and column grouping statistics for (c1,c2). When you run a query on Table2 after you use the ANALYZE TABLE statement, the query generally runs faster.ANALYZE TABLE myschema.Table2 COMPUTE STATISTICS FOR COLUMNS (c1,c2),c3,c4;
- Example 2: Analyze a table and specific columns and then use SYSSTAT.COLUMNS to view the statistics.
- Gather statistics on table
MRK_PROD_SURVEY_TARG_FACT:
Select from the SYSSTAT.COLUMNS catalog table to display the statistics:ANALYZE TABLE gosalesdw.mrk_prod_survey_targ_fact COMPUTE STATISTICS FOR COLUMNS month_key,product_key,product_survey_key, product_topic_target;
Output with part of the statistics is shown in the following example:SELECT CAST(COLNAME AS VARCHAR(20)) AS "COL_NAME", COLCARD, CAST(HIGH2KEY AS VARCHAR(100)) AS "HIGH2KEY", CAST(LOW2KEY AS VARCHAR(100)) AS "LOW2KEY", NUMNULLS, AVGCOLLEN FROM SYSSTAT.COLUMNS WHERE TABNAME = 'MRK_PROD_SURVEY_TARG_FACT' AND (COLNAME='PRODUCT_KEY' OR COLNAME='PRODUCT_TOPIC_TARGET') ORDER BY COLNAME;
+----------------------+---------+----------+---------+----------+-----------+ | COL_NAME | COLCARD | HIGH2KEY | LOW2KEY | NUMNULLS | AVGCOLLEN | +----------------------+---------+----------+---------+----------+-----------+ | PRODUCT_KEY | 90 | 30132 | 30001 | 0 | 4 | | PRODUCT_TOPIC_TARGET | 8 | 1.0 | 0.495 | 0 | 8 | +----------------------+---------+----------+---------+----------+-----------+
- Example 3: Use the TABLESAMPLE SYSTEM parameter to collect statistics on your DATALAKE table.
-
ANALYZE TABLE myschema.Table2 COMPUTE STATISTICS FOR COLUMNS (c1,c2),c3,c4 TABLESAMPLE SYSTEM (10);
- Example 4: Use the cumulative statistics feature of the ANALYZE TABLE statement to collect statistics on an additional set of columns.
-
ANALYZE TABLE myschema.Table2 COMPUTE STATISTICS FOR COLUMNS (c1,c2),c3,c4; ANALYZE TABLE myschema.Table2 COMPUTE STATISTICS FOR COLUMNS c5,c6;
- Example 5: Collect table statistics with no columns specified.
-
ANALYZE TABLE myTable COMPUTE STATISTICS TABLESAMPLE BERNOULLI (5); ANALYZE TABLE myTable COMPUTE STATISTICS TABLESAMPLE SYSTEM (10);
- Example 6: Run ANALYZE TABLE on some column groups of myTable, specifying TABLESAMPLE SYSTEM.
-
ANALYZE TABLE myTable COMPUTE STATISTICS FOR COLUMNS (i,j),(k,l,m) TABLESAMPLE SYSTEM (10);