How to diagnose and correct LOAD HADOOP problems

The Db2® Big SQL LOAD HADOOP statement is a powerful tool to load data from relational databases or delimited files that are in an external or local file systems. This guide to diagnosing and correcting problems that you might encounter while using the LOAD statement can help you to further expand your use of LOAD.

For detailed information about this statement, see LOAD HADOOP statement.

Finding error information in the bigsql.log file

When the LOAD statement fails, the SQL message that is returned can contain a Log Entry Identifier, such as [BSL-n-xxxxxxx]. This identifier marks the location in the bigsql.log file where more error information can be found. The n position in the identifier, is the Db2 Big SQL node where the file exists. The default local file location is /var/ibm/bigsql/logs/bigsql.log.

You can also use the function SYSHADOOP.LOG_ENTRY to display a snippet of the bigsql.log file near the log entry ID. This is helpful if you are using a remote JDBC client to connect to Db2 Big SQL. The following SELECT statement shows how to use the SYSHADOOP.LOG_ENTRY function to get the information from the bigsql.log file:
SELECT * FROM TABLE(SYSHADOOP.LOG_ENTRY('BSL-0-22b40517b'));

Finding error information in the logs of MapReduce jobs

When the LOAD HADOOP statement fails, the SQL message can contain a job identifier. The job identifier is the Hadoop MapReduce job identifier that is run by the LOAD statement. This job identifier can be used to locate the Hadoop job in the MapReduce Job History web page. The following address is the default URL for the Job History web page:
http://<mapreduce-server>.com:19888/jobhistory
The mapreduce-server is the node where the MapReduce History server is installed.
On the Job History web page, find and select the job identifier for the LOAD job. There are three places that you can see job logs:
Application Master
The MapReduce Job xxxx page contains a section called Application Master. In this section, select the link to logs. Expand the Click Here link to see the entire file.

This log file might contain errors when starting, committing, or cleaning up the job. It can also have Out of Memory or other resource errors.

Map
The MapReduce Job xxxx page contains a section with column Task Type. Select the map link, which shows the map tasks that were run. Select each map task link to see the task attempts. Each task attempt contains a logs link. Open the log and expand by selecting the Click Here link to see the entire file.

These log files might have rejected record information, problems related to accessing the source data, or problems related to writing to the target table.

Reduce
The MapReduce Job xxxx page contains a section with column Task Type. Select the reduce link, which shows the reduce tasks that were run. Not all LOAD jobs have reduce tasks, so there might not be any tasks listed.

Select each reduce task link to see the task attempts. Each task attempt contains a logs link. Open the log and expand by selecting the Click Here link to see the entire file.

These log files might have error information about writing to the partitions of the target table.

Rejected records

LOAD might complete successfully and return SQL CODE 5108, but some of the rows were rejected. LOAD might stop and fail with SQL CODE 5109, if the number of rejected records exceeds the maximum set with option max.rejected.records.

When a record is rejected, the reason is written to the map task attempt log. The following list represents some of the reasons that records might be rejected. Each reason also contains the corrective action:
Mismatch in source and target data types
  • You might need to use a different data type for your target table.
  • For a database table source, you can CAST the source column to the intended target data type.
  • Consult the JDBC driver documentation for the type of database source to determine the native data type mapping to JDBC.
Null value for a column which does not allow null values
  • For a file source, you can use option rejected.records.dir to write the rejected records to a file. You can then inspect the rejected records file, make the needed corrections, and retry the LOAD statement.
Overflow of decimal value
  • You might need to specify the target column with a different precision and scale to match the source data.
  • You can use the decimal.overvlow option to round values to match the target precision and scale.
Multi-row source file
  • If the source file contains records that span multiple lines, then you must use the allow.multiline.record option set to a true value.
Incorrect field delimiter which can cause a record to not get parsed
  • Check the source file and verify that it contains the same field delimiter that you specified in the field.delimiter option.
  • If the field delimiter is a non-printable character, then you can specify it as an escaped character (\t) or the octal value (\001).

Source file data corrections

There are several LOAD options that can help you load data from a file with the following characteristics:
  • Column names on the top line.
  • Too many fields.
  • Fields in the wrong order.
  • Multi-line records.
  • Non-standard date or timestamp formats.
  • Decimal values that overflow the target precision or scale.
  • Place holder values for NULL.
Use the following options to avoid some LOAD errors:
  • date.time.format - useful for date and timestamp formats.
  • replace.with.null and replace.string.with.null - placeholder strings for NULL values.
  • skip.lines.count - skip the top lines that do not have data.
  • allow.multiline.record - allows records to span multiple lines in the file.
  • decimal.overflow - round decimal values to fit the target precision and scale.
  • field.indexes - reorder the fields in the source file to match the target column.
  • ignore.extra.fields - useful for when the number of source fields do not match the number of target columns.