Troubleshooting import and export of Hadoop formats
Read how to solve problems when importing and exporting Hadoop specific formats.
Known limitations
- Import to Hadoop formats is implemented as a Hive statement which inserts text data transferred from Db2® Warehouse into a Hive table that is stored in one of the Hadoop formats. The same applies to export in the opposite direction. Therefore, data transfer strongly depends on Hive and on the implementation of formats on a particular Hadoop distribution. This means that any limitations and defects that exist on a particular Hadoop distribution also affect data movement.
- The
enable.doAs=true
setting in Hive is required in order to import to Hadoop format. With this setting operations executed in Hive are executed with a connected user, rather than the user who started HiveServer. - Nonprinting control characters cannot be processed when using Hadoop formats. If your data contains these characters, do not import to Hadoop formats.
- Do not use Avro format when your data contains NUMERIC fields in Db2 Warehouse, which are then mapped to DECIMAL on Hadoop. Converted data might be inconsistent.
- Moving a table containing timestamp data type that is stored in ORC format might lead to data inconsistencies. This problem depends on JDBC driver and Hive version. You should always double-check that the data is consistent after movement. You can use checksum calculation for that purpose.
- IBM BigInsights limitations:
- On BigInsights 3, importing to Avro and Parquet is not supported and it results in the following
error:
com.ibm.fq.common.exception.hive.HiveException: Unable to execute Hive sql. Error while processing statement: FAILED: SemanticException Unrecognized file format in STORED AS clause: avr
- On BigInsights 3, do not import tables to ORC format if they contain
bigint
andint8
data types. - On BigInsights 4.0, using Parquet is not supported and it results in one of the following
errors:
Incompatible Long vector column and primitive category VOID
Unimplemented vector assigner for writable type class org.apache.hadoop.hive.serde2.io.HiveDecimalWritable
- On BigInsights 4.0, importing timestamp data type to Avro format is not supported.
- Import to BigSQL is not supported. You can import a table to Hive and then synchronize it manually to BigSQL. However, querying files imported in Avro format from BigSQL is not supported even after manual synchronization.
- On BigInsights 3, importing to Avro and Parquet is not supported and it results in the following
error:
- Cloudera limitations:
- Import to Hadoop formats is not supported for Cloudera versions prior to 5.3.
- Import of tables with new line characters within rows is not possible for Parquet. The error symptom is extra rows with unaligned data on the record with new line character. You can import such data using other compression formats.
- Import of timestamp data type to Avro format is not supported on Cloudera.
- Impala does not support ORC format.
- Querying imported Avro tables in Impala is not supported on Cloudera versions prior to 5.5.
- Hortonworks limitations:
- On Hortonworks 2.2, it is not possible to specify compression type in
fq.output.compressed
for Avro. You can leave the parameter empty and then default Hadoop compression is used. - On Hortonworks 2.2 import to Parquet is not supported.
- When using Hortonworks 2.3, the
add jar
command for Hive JDBC connection might result inNullPointerException
- On Hortonworks 2.2, it is not possible to specify compression type in
Common errors and solutions
- Data movement fails with the following
error:
ERROR com.ibm.fq.ConnectionValidator - Unable to execute Hive sql. Error while processing statement: Cannot modify avro.output.codec at runtime. It is not in list of params that are allowed to be modified at runtime
Add or modify a custom Hive parameter "hive.security.authorization.sqlstd.confwhitelist.append
" with valueavro.output.codec
as in the following example:<property> <name>hive.security.authorization.sqlstd.confwhitelist.append</name> <value>avro.output.codec</value> </property>
If multiple entries are needed, for instance, both Parquet and Avro compression is used, the parameter should be specified as follows:<property> <name>hive.security.authorization.sqlstd.confwhitelist.append</name> <value>parquet\.compression|avro\.output\.codec</value> </property>
- Export of Hive table in Parquet fails if you use WHERE clause on a FLOAT column. This is a Hive issue and it is described in the following topic: https://issues.apache.org/jira/browse/HIVE-13114
Symptoms of failed import to Hadoop formats
When the import operation to Hadoop formats is interrupted for any reason, two tables instead of one are created in Hive/HDFS:
- Destination table <table_name>_npscopy
-
- For AVRO format, any query might throw the following
error:
java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Neither avro.schema.literal nor avro.schema.url specified, can't determine table schema
- For other formats table behaves like an empty table.
- For AVRO format, any query might throw the following
error:
- Interim table <table_name>_<process_id>_<miliseconds_from_1970.01.01>_npscopy
- for example: table1_3480_1487843174319_npscopy
- User is not able to query this table.
-
Error: java.io.IOException: cannot find class com.ibm.nz.codec.NzCopyInputFormat
If two tables are created after the import operation, and you see any of these error messages when querying tables, it means that the import operation failed at some point.
Custom settings for Hive
You can define custom settings before data is inserted into Hive by setting fq.data.format.setting property in the configuration XML file. It allows you to define your own properties for compression, or any additional Hive settings related to the destination format. As a value, provide a list of comma-separated settings (without theSET
command):<property>
<name>fq.data.format.setting</name>
<value>hive.exec.compress.output=true,avro.output.codec=snappy</value>
</property>
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;
Verifying intermediate text data files
If the results of import to Hadoop formats are incorrect, you can analyze the problem with the help of fq.debug.keep.tmptable parameter. When set to true, the intermediate text files that are used for insert to the target table in Hadoop format, are not deleted after the import is finished. This parameter can only be used with fq.data.directmovement=false, which is default. After importing with fq.debug.keep.tmptable=true, text files are kept in <target_table_dir>/tmptable subdirectory and Hive text table based on this directory called <target_table>NPSCOPY is also available.