ANALYZE command

Use the ANALYZE command to gather statistics for Hadoop and HBase tables. These statistics are used by the optimizer to determine the most optimal access plans to efficiently process your queries. The more statistics that you collect on your tables, the better decisions the optimizer can make to provide the best possible access plans. When you run ANALYZE on a table to collect these statistics, the query generally runs faster.

Important: Do not use the RUNSTATS command against Hadoop or HBase tables, because doing so deletes any previously collected statistics, which has an adverse impact on query performance.

There are two levels of statistics that you can collect:

Table level:: You can gather statistics about table level characteristics, such as the number of records.
Column level:: You can gather statistics about your columns, such as the number of distinct values. You can also gather statistics for column groups, which is useful if columns have a relationship.

Authorization

The authorization ID of the command must hold at least one of the following privileges or authorities:

CONTROL privilege on the table.
The creator of the table automatically has this privilege. For all other users, the CONTROL privilege must be granted to the user. Granting the CONTROL privilege through a group or a role is not sufficient. ACCESSCTRL or SECADM authority is required to grant the CONTROL privilege.
DATAACCESS authority on the database.
The creator of the database and users with DBADM authority automatically have this authority. A user with SECADM authority can grant the DATAACCESS authority to others.

Syntax

analyze-col

colgroup

colobj

table-sampling

Description

table-name

The name of the table that you want to analyze. You can specify any Db2® Big SQL table (including DB2® regular tables) or views.

COMPUTE STATISTICS

Gathers statistics. You can include the following options:

analyze-col

INCREMENTAL: Only partitions that do not have updated statistics are scanned. The option is ignored for HBase tables, or if the table is not partitioned.
FULL: For a partitioned table, this value results in a full scan of the table to collect statistics. On a non-partitioned table, the scan is always a FULL scan.

NOSCAN

When you specify the optional parameter NOSCAN, there is some performance improvement in the ANALYZE command because ANALYZE does not scan files. By using NOSCAN, ANALYZE gathers only the following statistics:

Number of files
Table size in bytes
Number of partitions

PARTIALSCAN

The PARTIALSCAN option is valid for tables that use the RCFile format only. This option is not valid for HBase tables. Only the block header information of the file is accessed to get the file size in bytes, and the number of files.

COPYHIVE

No statistics are gathered on the table. The statistics that the Hive metastore has on the table and its columns is copied to the Db2 Big SQL metastore.

FOR COLUMNS

Table statistics and column-level statistics are gathered for the columns that you specify. You must specify at least one column, or one column group as a parameter in FOR COLUMNS, or ANALYZE returns a syntax error.

colobj

You can specify columns, column groups, or both. Separate each column group or column name with a comma. Enclose column groups within parentheses. For a column group, only the number of distinct values is gathered on the grouping of columns. You can intermix individual columns and column groups. For example,

...FOR COLUMNS (c1,c3),c2,(c4,c5),c7,c8...;

FOR ALL COLUMNS

Table statistics and column-level statistics are gathered for all columns of the table. This option is used when ANALYZE is automatically triggered by Db2 Big SQL.

colgroup: An optional list of column groups only can be included between the COLUMNS and TABLESAMPLE (optional) clauses. Any individual columns that you specify trigger a syntax error.
Enclose column groups within parentheses. For a column group, only the number of distinct values is gathered on the grouping of columns.

TABLESAMPLE SYSTEM | BERNOULLI (numeric-literal)

SYSTEM

This parameter is supported for all Db2 Big SQL table types, including views. You use this parameter to collect statistics on a sample of HDFS splits. The term splits means the division of work that Db2 Big SQL generates to compute data in parallel, which can vary according to file type (such as text file or parquet), table type (HADOOP or HBASE), and configuration settings. These sample statistics are then used to extrapolate the statistics of the entire table. The numeric-literal is the target percentage of splits to scan during ANALYZE. Therefore, if the value of the numeric-literal is 10, it might mean that 10% of the splits are sampled. For example, if the table has data that resides on 10 splits, 1 split is used in the sample. However, if the table is small enough and resides on 2 splits, then TABLESAMPLE SYSTEM (10) scans 1 split, which is about 50% of the table. ANALYZE makes adjustments for small tables so that statistics are extrapolated correctly. A default sample size of 10% is used when ANALYZE is automatically triggered in Db2 Big SQL. This value speeds up the performance of the ANALYZE command with little impact on query performance.

BERNOULLI

This parameter is supported only for statistical views. You use this parameter to collect statistics on a sample of the rows from the statistical view rather than all of the rows. The numeric-literal is the target percentage of rows to analyze.

Statistical views can be very large because they can potentially describe join operations between multiple large tables. When you use this parameter, you can significantly reduce the time it takes to run ANALYZE for a statistical view.

Bernoulli sampling considers each row individually, including the row with probability P/100 (where P is the value of the numeric-literal ) and excluding it with probability 1-P/100. Therefore, if the value of the numeric-literal is the value 10, which represents a 10% sample, each row is included with a probability of 0.1, and is excluded with a probability of 0.9.

Usage notes

Results are written to the Db2 stats catalogs and to HDFS.
You must include at least one column name in the ANALYZE command if you specify FOR COLUMNS.
When you run the ANALYZE command against a table on a set of columns, and then later run ANALYZE on a second set of columns, the statistics that are gathered from the first ANALYZE command are merged with the statistics that are gathered from the second ANALYZE command. Therefore, if you decide that you need to run ANALYZE on additional columns after the first ANALYZE command is run, be sure to run ANALYZE on the second set of columns only. This practice speeds up the time that it takes ANALYZE to complete.
You can modify a property in Db2 Big SQL that disables or re-enables cumulative statistics. For instructions on turning cumulative statistics on or off, see Enabling or disabling cumulative statistics collection.
Because gathering statistics is crucial for good query performance, ANALYZE is triggered after a successful LOAD HADOOP statement. The ANALYZE command runs by default on all the columns in the table using a 10% table sample. The following is an example of an ANALYZE command that runs after a successful load operation:
```
ANALYZE TABLE schema.table
  COMPUTE STATISTICS FOR ALL COLUMNS
  TABLESAMPLE SYSTEM(10);
```
You can modify a property in Db2 Big SQL that disables or enables an automatic analyze operation after a LOAD HADOOP statement. For instructions on enabling or disabling ANALYZE after a load operation, see Configuring automatic ANALYZE after LOAD HADOOP.
Because gathering statistics is crucial for good query performance, ANALYZE is automatically triggered after the HCAT_SYNC_OBJECTS stored procedure is called to ingest data into Db2 Big SQL from Hive. The ANALYZE command runs on all the columns in the table by default using a 10% table sample. The following is an example of an ANALYZE command that is run after a successful HCAT_SYNC_OBJECTS call:
```
ANALYZE TABLE schema.table
  COMPUTE STATISTICS FOR ALL COLUMNS
  TABLESAMPLE SYSTEM(10);
```
You can modify a property in Db2 Big SQL that disables or enables an automatic analyze operation after an HCAT_SYNC_OBJECTS call. For instructions on enabling or disabling ANALYZE after HCAT_SYNC_OBJECTS, see Configuring automatic ANALYZE after HCAT_SYNC_OBJECTS.
When a Db2 Big SQL table increases in size significantly, the statistics that were previously gathered become out of date. As a result, ANALYZE must be run for optimal query performance. Db2 Big SQL automatically checks to see whether Hadoop or HBase tables changed significantly, and if so, ANALYZE is run automatically.
Although the memory footprint of the ANALYZE command has been reduced considerably, here are some ways to reduce the footprint even further:
- If you do not require distribution statistics for columns, turn them off by setting the following properties to zero (0):
  biginsights.stats.hist.num
  The number of histogram buckets.
```
set hadoop property biginsights.stats.hist.num=0;
```
  biginsights.stats.mfv.num
  The number of most frequent values.
```
set hadoop property biginsights.stats.mfv.num=0;
```
  If you set these properties to 0, the ANALYZE command can use less memory and less storage space. With the properties set to 0, histogram and MFV statistics are not collected. However, basic statistics like min, max, cardinality, and number of distinct values (NDV) are still collected.
- Because automatic ANALYZE commands are run against all columns in a table, in cases where the table has a lot of columns, the storage and memory requirements increase when distribution and MFV statistics are collected. By default, if a table has over 50 columns, any ANALYZE command that specifies the FOR ALL COLUMNS clause does not collect distribution and MFV statistics. If you want to increase or decrease this value, toggle it by setting the following property:
```
SET HADOOP PROPERTY biginsights.stats.wide.table.min.columns=50;
```
The NOSCAN option enables ANALYZE to complete much faster. However, this can impact performance.
It is a good idea to gather statistics on all Db2 Big SQL tables that are used in your queries. Collect column-level statistics for all columns that your predicates reference, including join predicates. Collect column group statistics for all sets of related columns, such as (country, city), that your predicates reference.
You can query the SYSCAT.TABLES catalog view to determine the total number of partitions, the number of files, and the total size (in kilobytes) of an Hadoop table, if you have run the ANALYZE command.
Do not include the ARRAY, ROW, or STRUCT data type in the FOR COLUMNS clause. You can analyze a table that contains these data types, but no statistics are returned for those columns.
Using the TABLESAMPLE SYSTEM clause might cause more data to be scanned than was requested if the number of splits is small. For example, if the table has only one split, the entire table is analyzed. If 10% sampling is requested and the number of splits is 2, approximately 50% (one split) of the table data is scanned.
When FOR ALL COLUMNS is specified on a partitioned table with more than one partitioning column, and no column groups are included, statistics are automatically collected on a column group that contains all of the partitioning columns.
Concurrent analyze operations on the same table are not supported. If you start an analyze operation while another one is in progress on the same table, the second operation is canceled and a warning message is returned.

Examples

Example 1: Analyze a non-partitioned table.

ANALYZE TABLE myschema.Table2 
  COMPUTE STATISTICS FOR COLUMNS (c1,c2),c3,c4;

This statement gathers statistics for Table2, along with column statistics for c3 and c4 and column grouping statistics for (c1,c2). When you run a query on Table2 after you use the ANALYZE command, the query generally runs faster.

Example 2: Analyze a table and specific columns and then use SYSSTAT.COLUMNS to view the statistics.

Gather statistics on table MRK_PROD_SURVEY_TARG_FACT:

ANALYZE TABLE gosalesdw.mrk_prod_survey_targ_fact
  COMPUTE STATISTICS FOR COLUMNS month_key,product_key,product_survey_key,
    product_topic_target;

Select from the SYSSTAT.COLUMNS catalog table to display the statistics:

SELECT 
    CAST(COLNAME AS VARCHAR(20)) AS "COL_NAME", COLCARD, 
    CAST(HIGH2KEY AS VARCHAR(100)) AS "HIGH2KEY", 
    CAST(LOW2KEY AS VARCHAR(100)) AS "LOW2KEY", NUMNULLS, AVGCOLLEN 
  FROM SYSSTAT.COLUMNS
  WHERE
    TABNAME = 'MRK_PROD_SURVEY_TARG_FACT' AND
   (COLNAME='PRODUCT_KEY' OR COLNAME='PRODUCT_TOPIC_TARGET')
  ORDER BY COLNAME;

Output with part of the statistics is shown in the following example:


+----------------------+---------+----------+---------+----------+-----------+
| COL_NAME             | COLCARD | HIGH2KEY | LOW2KEY | NUMNULLS | AVGCOLLEN |
+----------------------+---------+----------+---------+----------+-----------+
| PRODUCT_KEY          |      90 | 30132    | 30001   |        0 |         4 |
| PRODUCT_TOPIC_TARGET |       8 | 1.0      | 0.495   |        0 |         8 |
+----------------------+---------+----------+---------+----------+-----------+

Example 3: Analyze an HBase table.

CREATE HBASE TABLE HBTable (
  c1 INT,
  c2 INT,
  c3 INT,
  c4 VARCHAR(20),
  c5 VARCHAR(40),
  c6 VARCHAR(90)
)
  COLUMN MAPPING (
    KEY MAPPED BY (c1,c2,c3),
    cf1:cq1 MAPPED BY (c4,c5)
      ENCODING DELIMITED
      FIELDS TERMINATED BY '\b',
    cf1:cq2 MAPPED BY (c6)
      ENCODING BINARY
  )
DEFAULT ENCODING BINARY;

ANALYZE TABLE HBTable
  COMPUTE STATISTICS FOR COLUMNS (c1,c2,c3,c4,c5,c6);

This statement gathers statistics for HBTable, along with column statistics for all of the columns in the table. The key is c1, c2, and c3. Columns c4, c5 are part of column family cf1:cq1. Column c6 is part of column family cf1:cq2.

Example 4: Use the TABLESAMPLE SYSTEM parameter to collect statistics on your Hadoop or HBase table.

ANALYZE TABLE myschema.Table2
  COMPUTE STATISTICS FOR COLUMNS (c1,c2),c3,c4
    TABLESAMPLE SYSTEM (10);

Example 5: Use the cumulative statistics feature of the ANALYZE command to collect statistics on an additional set of columns.

ANALYZE TABLE myschema.Table2
  COMPUTE STATISTICS FOR COLUMNS (c1,c2),c3,c4;
ANALYZE TABLE myschema.Table2
  COMPUTE STATISTICS FOR COLUMNS c5,c6;

Example 6: Collect table statistics with no columns specified.

ANALYZE TABLE myTable
  COMPUTE STATISTICS TABLESAMPLE BERNOULLI (5);
ANALYZE TABLE myTable
  COMPUTE STATISTICS TABLESAMPLE SYSTEM (10);

Example 7: Run ANALYZE on some column groups of myTable, specifying TABLESAMPLE SYSTEM.

ANALYZE TABLE myTable
  COMPUTE STATISTICS FOR COLUMNS (i,j),(k,l,m)
    TABLESAMPLE SYSTEM (10);