Troubleshooting import and export of Hadoop formats

Read how to solve problems when importing and exporting Hadoop specific formats.

Known limitations

Following is a list of limitations and possible errors when using Fast Data Movement with Hadoop formats. These limitations were found when testing the data movement feature on the listed environments at the time of the release. Note that these issues might be fixed on the side of database service provider at any time.
  • Import to Hadoop formats is implemented as a Hive statement which inserts text data transferred from Db2® Warehouse into a Hive table that is stored in one of the Hadoop formats. The same applies to export in the opposite direction. Therefore, data transfer strongly depends on Hive and on the implementation of formats on a particular Hadoop distribution. This means that any limitations and defects that exist on a particular Hadoop distribution also affect data movement.
  • The enable.doAs=true setting in Hive is required in order to import to Hadoop format. With this setting operations executed in Hive are executed with a connected user, rather than the user who started HiveServer.
  • Nonprinting control characters cannot be processed when using Hadoop formats. If your data contains these characters, do not import to Hadoop formats.
  • Do not use Avro format when your data contains NUMERIC fields in Db2 Warehouse, which are then mapped to DECIMAL on Hadoop. Converted data might be inconsistent.
  • Moving a table containing timestamp data type that is stored in ORC format might lead to data inconsistencies. This problem depends on JDBC driver and Hive version. You should always double-check that the data is consistent after movement. You can use checksum calculation for that purpose.
  • IBM BigInsights limitations:
    • On BigInsights 3, importing to Avro and Parquet is not supported and it results in the following error:
      com.ibm.fq.common.exception.hive.HiveException: Unable to execute Hive sql. 
      Error while processing statement: FAILED: SemanticException Unrecognized file 
      format in STORED AS clause: avr
    • On BigInsights 3, do not import tables to ORC format if they contain bigint and int8 data types.
    • On BigInsights 4.0, using Parquet is not supported and it results in one of the following errors:
      Incompatible Long vector column and primitive category VOID
      Unimplemented vector assigner for writable type class 
      org.apache.hadoop.hive.serde2.io.HiveDecimalWritable
    • On BigInsights 4.0, importing timestamp data type to Avro format is not supported.
    • Import to BigSQL is not supported. You can import a table to Hive and then synchronize it manually to BigSQL. However, querying files imported in Avro format from BigSQL is not supported even after manual synchronization.
  • Cloudera limitations:
    • Import to Hadoop formats is not supported for Cloudera versions prior to 5.3.
    • Import of tables with new line characters within rows is not possible for Parquet. The error symptom is extra rows with unaligned data on the record with new line character. You can import such data using other compression formats.
    • Import of timestamp data type to Avro format is not supported on Cloudera.
    • Impala does not support ORC format.
    • Querying imported Avro tables in Impala is not supported on Cloudera versions prior to 5.5.
  • Hortonworks limitations:
    • On Hortonworks 2.2, it is not possible to specify compression type in fq.output.compressed for Avro. You can leave the parameter empty and then default Hadoop compression is used.
    • On Hortonworks 2.2 import to Parquet is not supported.
    • When using Hortonworks 2.3, the add jar command for Hive JDBC connection might result in NullPointerException

Common errors and solutions

  • Data movement fails with the following error:
    ERROR com.ibm.fq.ConnectionValidator - Unable to execute Hive sql. Error while processing
          statement: Cannot modify avro.output.codec at runtime. It is not in list of params that are
          allowed to be modified at runtime
    Add or modify a custom Hive parameter "hive.security.authorization.sqlstd.confwhitelist.append" with value avro.output.codec as in the following example:
    <property>
    <name>hive.security.authorization.sqlstd.confwhitelist.append</name>
    <value>avro.output.codec</value>
    </property>
    If multiple entries are needed, for instance, both Parquet and Avro compression is used, the parameter should be specified as follows:
    <property>
    <name>hive.security.authorization.sqlstd.confwhitelist.append</name>
    <value>parquet\.compression|avro\.output\.codec</value>
    </property>
  • Export of Hive table in Parquet fails if you use WHERE clause on a FLOAT column. This is a Hive issue and it is described in the following topic: https://issues.apache.org/jira/browse/HIVE-13114

Symptoms of failed import to Hadoop formats

When the import operation to Hadoop formats is interrupted for any reason, two tables instead of one are created in Hive/HDFS:

Destination table <table_name>_npscopy
  • For AVRO format, any query might throw the following error:
    java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: 
    Neither avro.schema.literal nor avro.schema.url specified, 
    can't determine table schema
  • For other formats table behaves like an empty table.
Interim table <table_name>_<process_id>_<miliseconds_from_1970.01.01>_npscopy
for example: table1_3480_1487843174319_npscopy
  • User is not able to query this table.
  • Error: java.io.IOException: cannot find class
        com.ibm.nz.codec.NzCopyInputFormat

If two tables are created after the import operation, and you see any of these error messages when querying tables, it means that the import operation failed at some point.

Custom settings for Hive

You can define custom settings before data is inserted into Hive by setting fq.data.format.setting property in the configuration XML file. It allows you to define your own properties for compression, or any additional Hive settings related to the destination format. As a value, provide a list of comma-separated settings (without the SET command):
<property>
        <name>fq.data.format.setting</name>
        <value>hive.exec.compress.output=true,avro.output.codec=snappy</value>
    </property>
As a result, a list of the following commands is sent to Hive:
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

Verifying intermediate text data files

If the results of import to Hadoop formats are incorrect, you can analyze the problem with the help of fq.debug.keep.tmptable parameter. When set to true, the intermediate text files that are used for insert to the target table in Hadoop format, are not deleted after the import is finished. This parameter can only be used with fq.data.directmovement=false, which is default. After importing with fq.debug.keep.tmptable=true, text files are kept in <target_table_dir>/tmptable subdirectory and Hive text table based on this directory called <target_table>NPSCOPY is also available.