There are a number of Analyze properties that can be modified in order to influence the behavior of the ANALYZE command. See Enabling or disabling cumulative statistics for an example of how to set properties in Big SQL. In this article, we will outline what the Analyze properties are, their range of values, and their behavior in BigInsights Big SQL v4.2.
Since v4.2
Auto-Analyze Properties
See Auto-Analyze Customization.
biginsights.stats.cumulative
Whether or not to preserve previously collected statistics of columns and/or column groups that have not been specified in a new Analyze execution against a table.
- Default: true
Pros and Cons
See Big SQL Analyze in Stages: Cumulative Statistics.
biginsights.stats.wide.table.min.columns
Specifies the threshold of number of columns for which a Hadoop table is considered a wide table. When the ALL COLUMNS clause is used on a Hadoop table that has more columns than this number, Histogram and MFV statistics will not be collected.
- Default/Min/Max: 50 / 0 / 32767
Pros
Prevents the statistics catalog from being filled with Histogram and MFV values, due to Hadoop tables having lots of columns.
Cons
The column(s) of a wide table that could benefit from Histogram and/or MFV values will not have those statistics collected during Analyze when the ALL COLUMNS clause is used.
Since v4.1.0.2
biginsights.stats.use.v2
Whether or not to use Analyze v2 for statistics collection.
- Default: true
Pros and Cons
See 10 Issues Resolved for Analyze in Big SQL v4.2.
bigsql.java.io.tp.size
⚠ WARNING:
This property can give a significant performance boost to Analyze. However, it also affects the Java Reader, potentially negatively. Queries executed against ORC, RCFile, custom formats, and/or HBase utilize the Java Reader. Tread carefully…
Controls the number of threads used to process the Analyze command on each worker node. Unless otherwise directed, it is recommended to only use this at the session level. For optimal performance, the number set should not exceed the number of CPU cores on a worker node.
The ratio of this number to the number of CPU cores allocated to Big SQL is approximately the percentage of Big SQL CPU resources that will be utilized by the Analyze command. For instance, suppose the total number of cores is 40, and this number is set to 8. If Big SQL is configured to use 50% of resources, then it would have roughly 20 cores. Analyze would therefore utilize approximately 8/20 = 40% of the CPU resources allocated to Big SQL.
- Default/Min/Max: 8 / 1 / 50
Pros
Increasing this number can result in significantly faster Analyze execution. Decreasing this number results in less resource (Memory and CPU) usage by Analyze.
Cons
Increasing this number increases the system resources (Memory and CPU) utilized by Analyze/Java Reader linearly. Decreasing this number reduces Analyze performance.
Since v3.0.0.2 IFIX
biginsights.stats.use.homedir
Whether the Analyze synopsis files (._biginsights_stats/) are stored in the table’s directory or the user’s home directory.
- Default: false
Pros
When false, the synopsis files are stored in the table’s directory (in HDFS). All users executing Analyze will be able to make use of these files for Incremental Analyze. Can be set to true if storage in the table’s directory is undesired, or security settings prevent writing to the table’s directory.
Cons
If set to true, the synopsis files will be stored in the executing user’s home directory in HDFS. This prevents one user from making use of the synopsis files generated by another user for Incremental Analyze.
Since v3.0
biginsights.stats.auto.alter
Whether or not to automatically alter STRING or VARCHAR(32672) columns to VARCHAR columns of an appropriate size for the data in the column. This alter only occurs if bigsql.string.size=32672.
- Default: true
Pros
When true, STRING and VARCHAR(32672) columns are altered to VARCHAR(x).
Cons
If loading a table in multiple stages, the column(s) could be set to an incorrect size if Auto-Analyze executes against the table before all stages of the Load complete.
biginsights.stats.hist.num
Controls the number of histogram values that are collected for each and every column specified (or implied by the ALL COLUMNS clause) . Histogram statistics collection can be disabled by setting this value to 0 or 1.
- Default/Min/Max: 100 / 0 / 32767
Pros
A larger number of histogram values allows for greater filtering of rows in queries, which can improve query performance. The reverse is true for a low number of histogram values.
Cons
A larger number of histogram values may result in slower performance of the Analyze command. Also, more statistics catalog storage is required. Disabling can have adverse effects on queries.
biginsights.stats.mfv.num
Controls the number of most frequent values (MFVs) that are collected for each and every column specified (or implied by the ALL COLUMNS clause) . MFV statistics collection can be disabled by setting this value to 0.
- Default/Min/Max: 100 / 0 / 32767
Pros
A larger number of MFVs allows for greater filtering of rows in queries, which can improve query performance. The reverse is true for a low number of MFVs.
Cons
A larger number of MFVs may result in slower performance of the Analyze command. Also, more statistics catalog storage is required. Disabling can have adverse effects on queries.
biginsights.stats.ndv.error
Determines the percentage of error that is allowed in the number of distinct values (NDV) estimation.
- Default/Min/Max: 5.0 / 0 / 100.0
Pros
A larger number can improve Analyze performance.
Cons
A larger number will likely result in less accurate NDV estimations, which can severely impact query performance.
Additional Information
Visit the Knowledge Center for more details on the ANALYZE command.
To get started, download the IBM Open Platform with Apache Hadoop.
Visit the Knowledge Center for installation and configuration information.