Big SQL Analyze Exposed: Analyze Properties

There are a number of Analyze properties that can be modified in order to influence the behavior of the ANALYZE command. See Enabling or disabling cumulative statistics for an example of how to set properties in Big SQL. In this article, we will outline what the Analyze properties are, their range of values, and their behavior in BigInsights Big SQL v4.2.

Since v4.2

Auto-Analyze Properties

See Auto-Analyze Customization.

biginsights.stats.cumulative

Whether or not to preserve previously collected statistics of columns and/or column groups that have not been specified in a new Analyze execution against a table.

- Default: true

Pros and Cons

See Big SQL Analyze in Stages: Cumulative Statistics.

biginsights.stats.wide.table.min.columns

Specifies the threshold of number of columns for which a Hadoop table is considered a wide table. When the ALL COLUMNS clause is used on a Hadoop table that has more columns than this number, Histogram and MFV statistics will not be collected.

- Default/Min/Max: 50 / 0 / 32767

Pros

Prevents the statistics catalog from being filled with Histogram and MFV values, due to Hadoop tables having lots of columns.

Cons

The column(s) of a wide table that could benefit from Histogram and/or MFV values will not have those statistics collected during Analyze when the ALL COLUMNS clause is used.

Since v4.1.0.2

biginsights.stats.use.v2

Whether or not to use Analyze v2 for statistics collection.

- Default: true

Pros and Cons

See 10 Issues Resolved for Analyze in Big SQL v4.2.

bigsql.java.io.tp.size

⚠ WARNING:
This property can give a significant performance boost to Analyze. However, it also affects the Java Reader, potentially negatively. Queries executed against ORC, RCFile, custom formats, and/or HBase utilize the Java Reader. Tread carefully…

Controls the number of threads used to process the Analyze command on each worker node. Unless otherwise directed, it is recommended to only use this at the session level. For optimal performance, the number set should not exceed the number of CPU cores on a worker node.

The ratio of this number to the number of CPU cores allocated to Big SQL is approximately the percentage of Big SQL CPU resources that will be utilized by the Analyze command. For instance, suppose the total number of cores is 40, and this number is set to 8. If Big SQL is configured to use 50% of resources, then it would have roughly 20 cores. Analyze would therefore utilize approximately 8/20 = 40% of the CPU resources allocated to Big SQL.

- Default/Min/Max: 8 / 1 / 50

Pros

Increasing this number can result in significantly faster Analyze execution. Decreasing this number results in less resource (Memory and CPU) usage by Analyze.

Cons

Increasing this number increases the system resources (Memory and CPU) utilized by Analyze/Java Reader linearly. Decreasing this number reduces Analyze performance.

Since v3.0.0.2 IFIX

biginsights.stats.use.homedir

Whether the Analyze synopsis files (._biginsights_stats/) are stored in the table’s directory or the user’s home directory.

- Default: false

Pros

When false, the synopsis files are stored in the table’s directory (in HDFS). All users executing Analyze will be able to make use of these files for Incremental Analyze. Can be set to true if storage in the table’s directory is undesired, or security settings prevent writing to the table’s directory.

Cons

If set to true, the synopsis files will be stored in the executing user’s home directory in HDFS. This prevents one user from making use of the synopsis files generated by another user for Incremental Analyze.

Since v3.0

biginsights.stats.auto.alter

Whether or not to automatically alter STRING or VARCHAR(32672) columns to VARCHAR columns of an appropriate size for the data in the column. This alter only occurs if bigsql.string.size=32672.

- Default: true

Pros

When true, STRING and VARCHAR(32672) columns are altered to VARCHAR(x).

Cons

If loading a table in multiple stages, the column(s) could be set to an incorrect size if Auto-Analyze executes against the table before all stages of the Load complete.

biginsights.stats.hist.num

Controls the number of histogram values that are collected for each and every column specified (or implied by the ALL COLUMNS clause) . Histogram statistics collection can be disabled by setting this value to 0 or 1.

- Default/Min/Max: 100 / 0 / 32767

Pros

A larger number of histogram values allows for greater filtering of rows in queries, which can improve query performance. The reverse is true for a low number of histogram values.

Cons

A larger number of histogram values may result in slower performance of the Analyze command. Also, more statistics catalog storage is required. Disabling can have adverse effects on queries.

biginsights.stats.mfv.num

Controls the number of most frequent values (MFVs) that are collected for each and every column specified (or implied by the ALL COLUMNS clause) . MFV statistics collection can be disabled by setting this value to 0.

- Default/Min/Max: 100 / 0 / 32767

Pros

A larger number of MFVs allows for greater filtering of rows in queries, which can improve query performance. The reverse is true for a low number of MFVs.

Cons

A larger number of MFVs may result in slower performance of the Analyze command. Also, more statistics catalog storage is required. Disabling can have adverse effects on queries.

biginsights.stats.ndv.error

Determines the percentage of error that is allowed in the number of distinct values (NDV) estimation.

- Default/Min/Max: 5.0 / 0 / 100.0

Pros

A larger number can improve Analyze performance.

Cons

A larger number will likely result in less accurate NDV estimations, which can severely impact query performance.

Additional Information

Visit the Knowledge Center for more details on the ANALYZE command.
To get started, download the IBM Open Platform with Apache Hadoop.
Visit the Knowledge Center for installation and configuration information.

Tips

Big SQL Analyze Exposed: Analyze Properties - Hadoop Dev

Technical Blog Post

Abstract

Body

Since v4.2

Auto-Analyze Properties

biginsights.stats.cumulative

Pros and Cons

biginsights.stats.wide.table.min.columns

Pros

Cons

Since v4.1.0.2

biginsights.stats.use.v2

Pros and Cons

bigsql.java.io.tp.size

Pros

Cons

Since v3.0.0.2 IFIX

biginsights.stats.use.homedir

Pros

Cons

Since v3.0

biginsights.stats.auto.alter

Pros

Cons

biginsights.stats.hist.num

Pros

Cons

biginsights.stats.mfv.num

Pros

Cons

biginsights.stats.ndv.error

Pros

Cons

Additional Information

UID

Share your feedback

Need support?