ANALYZE command
Use the ANALYZE command to gather statistics for Hadoop and HBase tables. These statistics are used by the optimizer to determine the most optimal access plans to efficiently process your queries. The more statistics that you collect on your tables, the better decisions the optimizer can make to provide the best possible access plans. When you run ANALYZE on a table to collect these statistics, the query generally runs faster.
- Table level:
- You can gather statistics about table level characteristics, such as the number of records.
- Column level:
- You can gather statistics about your columns, such as the number of distinct values. You can also gather statistics for column groups, which is useful if columns have a relationship.
Authorization
- CONTROL privilege on the table.
The creator of the table automatically has this privilege. For all other users, the CONTROL privilege must be granted to the user. Granting the CONTROL privilege through a group or a role is not sufficient. ACCESSCTRL or SECADM authority is required to grant the CONTROL privilege.
- DATAACCESS authority on the database.
The creator of the database and users with DBADM authority automatically have this authority. A user with SECADM authority can grant the DATAACCESS authority to others.
Syntax
Description
- table-name
- The name of the table that you want to analyze. You can specify any Db2® Big SQL table (including DB2® regular tables) or views.
- COMPUTE STATISTICS
- Gathers statistics. You can include the following options:
- analyze-col
-
- INCREMENTAL
- Only partitions that do not have updated statistics are scanned. The option is ignored for HBase tables, or if the table is not partitioned.
- FULL
- For a partitioned table, this value results in a full scan of the table to collect statistics. On a non-partitioned table, the scan is always a FULL scan.
- NOSCAN
-
When you specify the optional parameter NOSCAN, there is some performance improvement in the ANALYZE command because ANALYZE does not scan files. By using NOSCAN, ANALYZE gathers only the following statistics:
- Number of files
- Table size in bytes
- Number of partitions
- PARTIALSCAN
- The PARTIALSCAN option is valid for tables that use the RCFile format only. This option is not valid for HBase tables. Only the block header information of the file is accessed to get the file size in bytes, and the number of files.
- COPYHIVE
- No statistics are gathered on the table. The statistics that the Hive metastore has on the table and its columns is copied to the Db2 Big SQL metastore.
- FOR COLUMNS
- Table statistics and column-level statistics are gathered for the columns that you specify. You
must specify at least one column, or one column group as a parameter in FOR COLUMNS, or
ANALYZE returns a syntax error.
- colobj
- You can specify columns, column groups, or both. Separate each column group or column name with
a comma. Enclose column groups within parentheses. For a column group, only the number of distinct
values is gathered on the grouping of columns. You can intermix individual columns and column
groups. For example,
...FOR COLUMNS (c1,c3),c2,(c4,c5),c7,c8...;
- FOR ALL COLUMNS
- Table statistics and column-level statistics are gathered for all columns of the table. This
option is used when ANALYZE is automatically triggered by
Db2 Big SQL.
- colgroup
An optional list of column groups only can be included between the COLUMNS and TABLESAMPLE (optional) clauses. Any individual columns that you specify trigger a syntax error.
Enclose column groups within parentheses. For a column group, only the number of distinct values is gathered on the grouping of columns.
- TABLESAMPLE SYSTEM | BERNOULLI (numeric-literal)
-
- SYSTEM
-
This parameter is supported for all Db2 Big SQL table types, including views. You use this parameter to collect statistics on a sample of HDFS splits. The term splits means the division of work that Db2 Big SQL generates to compute data in parallel, which can vary according to file type (such as text file or parquet), table type (HADOOP or HBASE), and configuration settings. These sample statistics are then used to extrapolate the statistics of the entire table. The numeric-literal is the target percentage of splits to scan during ANALYZE. Therefore, if the value of the numeric-literal is 10, it might mean that 10% of the splits are sampled. For example, if the table has data that resides on 10 splits, 1 split is used in the sample. However, if the table is small enough and resides on 2 splits, then
TABLESAMPLE SYSTEM (10)
scans 1 split, which is about 50% of the table. ANALYZE makes adjustments for small tables so that statistics are extrapolated correctly. A default sample size of 10% is used when ANALYZE is automatically triggered in Db2 Big SQL. This value speeds up the performance of the ANALYZE command with little impact on query performance. - BERNOULLI
-
This parameter is supported only for statistical views. You use this parameter to collect statistics on a sample of the rows from the statistical view rather than all of the rows. The numeric-literal is the target percentage of rows to analyze.
Statistical views can be very large because they can potentially describe join operations between multiple large tables. When you use this parameter, you can significantly reduce the time it takes to run ANALYZE for a statistical view.
Bernoulli sampling considers each row individually, including the row with probability P/100 (where P is the value of the numeric-literal ) and excluding it with probability 1-P/100. Therefore, if the value of the numeric-literal is the value 10, which represents a 10% sample, each row is included with a probability of 0.1, and is excluded with a probability of 0.9.
Usage notes
- Results are written to the Db2 stats catalogs and to HDFS.
- You must include at least one column name in the ANALYZE command if you specify FOR COLUMNS.
- When you run the ANALYZE command against a table on a set of columns,
and then later run ANALYZE on a second set of columns, the statistics that are
gathered from the first ANALYZE command are merged with the statistics that are
gathered from the second ANALYZE command. Therefore, if you decide that you need
to run ANALYZE on additional columns after the first ANALYZE
command is run, be sure to run ANALYZE on the second set of columns only. This
practice speeds up the time that it takes ANALYZE to complete.
You can modify a property in Db2 Big SQL that disables or re-enables cumulative statistics. For instructions on turning cumulative statistics on or off, see Enabling or disabling cumulative statistics collection.
- Because gathering statistics is crucial for good query performance,
ANALYZE is triggered after a successful LOAD HADOOP statement. The
ANALYZE command runs by default on all the columns in the table using a 10% table
sample. The following is an example of an ANALYZE command that runs after a
successful load operation:
ANALYZE TABLE schema.table COMPUTE STATISTICS FOR ALL COLUMNS TABLESAMPLE SYSTEM(10);
You can modify a property in Db2 Big SQL that disables or enables an automatic analyze operation after a LOAD HADOOP statement. For instructions on enabling or disabling ANALYZE after a load operation, see Configuring automatic ANALYZE after LOAD HADOOP.
- Because gathering statistics is crucial for good query performance, ANALYZE
is automatically triggered after the HCAT_SYNC_OBJECTS stored procedure is called to ingest data
into Db2 Big SQL from Hive. The ANALYZE
command runs on all the columns in the table by default using a 10% table sample. The following is
an example of an ANALYZE command that is run after a successful HCAT_SYNC_OBJECTS
call:
ANALYZE TABLE schema.table COMPUTE STATISTICS FOR ALL COLUMNS TABLESAMPLE SYSTEM(10);
You can modify a property in Db2 Big SQL that disables or enables an automatic analyze operation after an HCAT_SYNC_OBJECTS call. For instructions on enabling or disabling ANALYZE after HCAT_SYNC_OBJECTS, see Configuring automatic ANALYZE after HCAT_SYNC_OBJECTS.
- When a Db2 Big SQL table increases in size significantly, the statistics that were previously gathered become out of date. As a result, ANALYZE must be run for optimal query performance. Db2 Big SQL automatically checks to see whether Hadoop or HBase tables changed significantly, and if so, ANALYZE is run automatically.
- Although the memory footprint of the ANALYZE command has been reduced
considerably, here are some ways to reduce the footprint even further:
- If you do not require distribution statistics for columns, turn them off by setting the
following properties to zero (0):
- biginsights.stats.hist.num
- The number of histogram
buckets.
set hadoop property biginsights.stats.hist.num=0;
- biginsights.stats.mfv.num
- The number of most frequent
values.
set hadoop property biginsights.stats.mfv.num=0;
- Because automatic ANALYZE commands are run against all columns in a table, in
cases where the table has a lot of columns, the storage and memory requirements increase when
distribution and MFV statistics are collected. By default, if a table has over 50 columns, any
ANALYZE command that specifies the FOR ALL COLUMNS clause does not collect
distribution and MFV statistics. If you want to increase or decrease this value, toggle it by
setting the following
property:
SET HADOOP PROPERTY biginsights.stats.wide.table.min.columns=50;
- If you do not require distribution statistics for columns, turn them off by setting the
following properties to zero (0):
- The NOSCAN option enables ANALYZE to complete much faster. However, this can impact performance.
- It is a good idea to gather statistics on all Db2 Big SQL tables that are used in your queries. Collect column-level statistics for all columns that your predicates reference, including join predicates. Collect column group statistics for all sets of related columns, such as (country, city), that your predicates reference.
- You can query the SYSCAT.TABLES catalog view to determine the total number of partitions, the number of files, and the total size (in kilobytes) of an Hadoop table, if you have run the ANALYZE command.
- Do not include the ARRAY, ROW, or STRUCT data type in the FOR COLUMNS clause. You can analyze a table that contains these data types, but no statistics are returned for those columns.
- Using the TABLESAMPLE SYSTEM clause might cause more data to be scanned than was requested if the number of splits is small. For example, if the table has only one split, the entire table is analyzed. If 10% sampling is requested and the number of splits is 2, approximately 50% (one split) of the table data is scanned.
- When FOR ALL COLUMNS is specified on a partitioned table with more than one partitioning column, and no column groups are included, statistics are automatically collected on a column group that contains all of the partitioning columns.
- Concurrent analyze operations on the same table are not supported. If you start an analyze operation while another one is in progress on the same table, the second operation is canceled and a warning message is returned.
Examples
- Example 1: Analyze a non-partitioned table.
This statement gathers statistics for Table2, along with column statistics for c3 and c4 and column grouping statistics for (c1,c2). When you run a query on Table2 after you use the ANALYZE command, the query generally runs faster.ANALYZE TABLE myschema.Table2 COMPUTE STATISTICS FOR COLUMNS (c1,c2),c3,c4;
- Example 2: Analyze a table and specific columns and then use SYSSTAT.COLUMNS to view the statistics.
- Gather statistics on table
MRK_PROD_SURVEY_TARG_FACT:
Select from the SYSSTAT.COLUMNS catalog table to display the statistics:ANALYZE TABLE gosalesdw.mrk_prod_survey_targ_fact COMPUTE STATISTICS FOR COLUMNS month_key,product_key,product_survey_key, product_topic_target;
Output with part of the statistics is shown in the following example:SELECT CAST(COLNAME AS VARCHAR(20)) AS "COL_NAME", COLCARD, CAST(HIGH2KEY AS VARCHAR(100)) AS "HIGH2KEY", CAST(LOW2KEY AS VARCHAR(100)) AS "LOW2KEY", NUMNULLS, AVGCOLLEN FROM SYSSTAT.COLUMNS WHERE TABNAME = 'MRK_PROD_SURVEY_TARG_FACT' AND (COLNAME='PRODUCT_KEY' OR COLNAME='PRODUCT_TOPIC_TARGET') ORDER BY COLNAME;
+----------------------+---------+----------+---------+----------+-----------+ | COL_NAME | COLCARD | HIGH2KEY | LOW2KEY | NUMNULLS | AVGCOLLEN | +----------------------+---------+----------+---------+----------+-----------+ | PRODUCT_KEY | 90 | 30132 | 30001 | 0 | 4 | | PRODUCT_TOPIC_TARGET | 8 | 1.0 | 0.495 | 0 | 8 | +----------------------+---------+----------+---------+----------+-----------+
- Example 3: Analyze an HBase table.
-
CREATE HBASE TABLE HBTable ( c1 INT, c2 INT, c3 INT, c4 VARCHAR(20), c5 VARCHAR(40), c6 VARCHAR(90) ) COLUMN MAPPING ( KEY MAPPED BY (c1,c2,c3), cf1:cq1 MAPPED BY (c4,c5) ENCODING DELIMITED FIELDS TERMINATED BY '\b', cf1:cq2 MAPPED BY (c6) ENCODING BINARY ) DEFAULT ENCODING BINARY;
ANALYZE TABLE HBTable COMPUTE STATISTICS FOR COLUMNS (c1,c2,c3,c4,c5,c6);
- Example 4: Use the TABLESAMPLE SYSTEM parameter to collect statistics on your Hadoop or HBase table.
-
ANALYZE TABLE myschema.Table2 COMPUTE STATISTICS FOR COLUMNS (c1,c2),c3,c4 TABLESAMPLE SYSTEM (10);
- Example 5: Use the cumulative statistics feature of the ANALYZE command to collect statistics on an additional set of columns.
-
ANALYZE TABLE myschema.Table2 COMPUTE STATISTICS FOR COLUMNS (c1,c2),c3,c4; ANALYZE TABLE myschema.Table2 COMPUTE STATISTICS FOR COLUMNS c5,c6;
- Example 6: Collect table statistics with no columns specified.
-
ANALYZE TABLE myTable COMPUTE STATISTICS TABLESAMPLE BERNOULLI (5); ANALYZE TABLE myTable COMPUTE STATISTICS TABLESAMPLE SYSTEM (10);
- Example 7: Run ANALYZE on some column groups of myTable, specifying TABLESAMPLE SYSTEM.
-
ANALYZE TABLE myTable COMPUTE STATISTICS FOR COLUMNS (i,j),(k,l,m) TABLESAMPLE SYSTEM (10);