Analyze Troubleshooting in Big SQL v4.2

Analyze Troubleshooting tips – what to do and look for when things go wrong.

Analyze v2 in Big SQL v4.2 has made tremendous strides in terms of performance, reliability, and capability. Some of these capabilities include:

ability to run multiple concurrent Analyze jobs
ability to Analyze hundreds of columns at a time
ability to preserve statistics when Analyzing new columns only
ability to Analyze only a sample of a table
ability to automatically Analyze tables

Analyze v2 achieved some of these by shedding reliance on Hive, MapReduce and YARN; making it more versatile. However, with added capabilities and versatility come added complexity, and as such, problems are likely to arise. Here’s how to do Analyze troubleshooting when the undesired occurs.

Is the Cluster Healthy?

If Analyze is performing slower than normal/expected, or errors are suddenly occurring, then this is the first thing you’ll want to confirm. Analyze performance is directly related to the number of active Big SQL worker nodes in the cluster – more is better. Analyze will still work if a Big SQL worker node or a few are down (albeit with reduced performance). However, Analyze and most other things Big SQL will not work if the head node is down. Verify that this is not the case.

Along with this, confirm that the following servers/nodes are healthy: HDFS, Hive and HBase (when Analyzing HBase tables). Oozie and ZooKeeper are indirectly required by Big SQL/Hadoop (there may be others), so check on those too. Finally, ensure there are no hardware issues, such as failed disks or network problems.

Does the User Have Authorization and Permission?

Users need to have appropriate authorization for executing Analyze. You should also confirm that the executing user has the appropriate permissions against the directories/files of the table’s directory in HDFS.

In some setups, security measures such as Kerberos will revoke write permissions on the table’s directory. In such instances, Analyze may not work with the default settings. To get around this, the biginsights.stats.use.homedir property can be set to true to work around this. With this setting in place, Analyze will write to the executing user’s home directory in HDFS, instead of the table’s directory.

Do SELECTs Work?

The vast majority of Analyze’s work is done in the Java Reader, and this work is a lot like a simple Select query. To quickly check if the problem is indeed with Analyze, do the following:

ANALYZE TABLE tab1 COMPUTE STATISTICS FOR ALL COLUMNS;  ← failing statement  
SET DFS_EXTERNAL_INPUT_LIBRARY = 'JAVA';                ← use Java Reader for queries  
SELECT * FROM tab1 FETCH FIRST 2 ROWS ONLY;             ← does this succeed?  
SELECT * FROM tab1 WHERE nonpartcol=missing-value;      ← does this succeed?  
SET DFS_EXTERNAL_INPUT_LIBRARY = NULL;                  ← restore the default Reader

If #3 or #4 fails, the problem is likely not directly due to Analyze. For #4, specify any non-partitioning column (nonpartcol) of the table (tab1) and supply a value that does not exist (missing-value) in that column’s data. The intention here is to scan all the data without returning any results. Note: Using a partitioning column in #4 will cause partition-elimination to be triggered, resulting in the data not being scanned.

What Does the Log Say?

Ring-ding-ding-ding … no, it should never say that. But in the error message of the failed Analyze execution, there should be something like: Log entry identifier: "[BSL-0-493b55a1e]".

BSL means Big SQL Log (bigsql.log). Consequently, SCL stands for Scheduler (bigsql-sched.log)
-0- means node 0, or head node. -1- is for worker node 1, -2- for worker node 2, etc.
493b55a1e is the unique hexadecimal identifier for this error

Log files can be found in $BIGSQL_DIST_VAR/logs/ of the indicated node. See Logs and their Locations for more information. Alternatively, log entries can be retrieved by issuing the following query:

SELECT * FROM TABLE(syshadoop.log_entry('BSL-0-493b55a1e'))

The log entry is usually an ERROR or sometimes WARN message, with a stack trace resembling:

...  at com.ibm.biginsights.biga.udf.stats.AnalyzeTool.analyze(AnalyzeTool.java:##)  at com.ibm.biginsights.biga.udf.BIGSQL_DDL.performAnalyze(BIGSQL_DDL.java:##)  at com.ibm.biginsights.biga.udf.BIGSQL_DDL.doAnalyzeStatement(BIGSQL_DDL.java:##)  at com.ibm.biginsights.biga.udf.BIGSQL_DDL.processDDL(BIGSQL_DDL.java:##)

Worker Node Indicator

Sometimes the BSL message shows node 0, but the error actually took place on a worker node. To determine if this is true, look for the following in the stack trace of the error message (unfortunately, this won’t tell you which worker node):

...  at com.ibm.db2.jcc.am.ip.execute(Unknown Source)  at com.ibm.biginsights.biga.udf.stats.AnalyzeCatalog.exec(AnalyzeCatalog.java:##)  at com.ibm.biginsights.biga.udf.stats.AnalyzeDB2.doAnalyze(AnalyzeDB2.java:##)  	                          OR .AnalyzeDB2.doPartitionAnalyze(AnalyzeDB2.java:##)

Rollover

Depending on the workload and logging level, log files may rollover. Therefore, the message you are looking for may be in bigsql.log.1, or some other number instead. If/when contacting IBM support for help, you’ll likely see faster resolution if the above (and below) information is provided, as these can be used to quickly identify what and where the issue is.

Out-of-Memory

On rare occasions, the error may indicate an out of memory condition. In such cases, it would be helpful to locate the corresponding heapdump.*.phd file that is generated on the affected node(s), which can be used by IBM support to help diagnose the issue.

Debugging is your Friend

Sometimes the error message indicates an issue that is data dependent. However, it is oftentimes impractical to determine what data in/about the table is causing the issue. If you wish to be ultra-helpful (and I know you do, because you’re such an awesome person), then DEBUG logging usually can help to deduce the issue. With debug logging, a significant amount of Analyze information is written out to the log file, which can help tremendously in identifying issues. To turn on debugging for Analyze on a particular node, append the following to the node’s $BIGSQL_HOME/conf/log4j.properties file:

log4j.logger.com.ibm.biginsights.biga=DEBUG  
log4j.logger.com.ibm.biginsights.bigsql=DEBUG

Save the file, restart Big SQL and re-execute the failing Analyze statement. The debugging information will then be written out to the bigsql.log file on the node(s). It is generally a good idea to perform debug logging on the head node, in addition to any other worker nodes that are indicated as problematic. Don’t forget to backup the original log4j.properties file(s), and restore it/them when done.

Debugging Without Restarting Big SQL

The logging level can also be changed without having to restart Big SQL, by executing the following against Big SQL:

/*_!log('com.ibm.biginsights.biga', 'DEBUG')!_*/ create schema foo;  
/*_!log('com.ibm.biginsights.bigsql', 'DEBUG')!_*/ create schema foo;

Don’t worry, create schema foo won’t actually execute. This changes the logging level system-wide, so don’t forget to undo the changes by setting the above back to the default logging level (WARN).

These are the most common things that need to be checked when investigating issues with Analyze. Of course, it doesn’t hurt to also verify that the correct syntax and property settings are being used.

Additional Information

Visit the Knowledge Center for more details on the ANALYZE command.
To get started, download the IBM Open Platform with Apache Hadoop.
Visit the Knowledge Center for installation and configuration information.

Tips

Analyze Troubleshooting in Big SQL v4.2 - Hadoop Dev

Technical Blog Post

Abstract

Body