Data movement known issues and limitations

Refer to this topic for information on possible limitations on data movement.

Import of some data types is not supported

The following data types are not supported in this release of IBM® Fast Data Movement:

binary
varbinary
DBCLOB
CLOB
BLOB
xml

IBM Fast Data Movement does not support complex data types that occur in Hive

The following complex data types are not supported:

ARRAY
MAP
STRUCT
UNIONTYPE
BINARY

Graphic data type can be imported to Hadoop, but exporting them back to Db2® Warehouse fails with the following error:

ERROR com.ibm.nz.fq.pda.ExternalTableExecutorThread  - nzexception The length, precision, or scale attribute for column, distinct type, structured type, array type, attribute of
structured type, routine, cast target type, type mapping, or global variable "GRAPHIC_DT" is not valid.. SQLCODE=-604, SQLSTATE=42611, DRIVER=4.23.42
com.ibm.db2.jcc.am.SqlSyntaxErrorException: The length, precision, or scale attribute for column, distinct type, structured type, array type, attribute of 
structured type, routine, cast target type, type mapping, or global variable "GRAPHIC_DT" is not valid.. SQLCODE=-604, SQLSTATE=42611, DRIVER=4.23.42

Cannot import tables with the same table name to the same HDFS directory

This problem might occur when importing a table from source database with the same name to Hadoop. The HDFS directories are not case-sensitive so you cannot import tables with the same table name to the same HDFS directory even if the letter case of table names is different or tables exist in different databases. Both cases will cause data overwriting on HDFS.

To avoid overwriting of data for such tables, change the HDFS directory using the fq.output.path parameter in the import XML configuration file. Do this even in the case when the tables are in different databases or their letter case is different. The following example explains this workaround in detail:

You have two tables with the same name but different letter case in the same database on Db2 Warehouse, for example tables test and TEST.

In the XML configuration file, set the property for the HDFS directory:

<property> 
	<name>fq.output.path</name> 
	<value>/dbbackup/backup1</value> 
</property>

Run the import for table test.

Now, in the XML configuration file, set the property for the HDFS directory:

<property> 
	<name>fq.output.path</name> 
	<value>/dbbackup/backup2</value> 
</property>

Run the import for table TEST.
If you follow this procedure, data is not overwritten and the tables are stored in different locations on HDFS.

Limitations on VARCHAR data types

By default, the parameter fq.hive.usevarchar is set to false, that is, the VARCHAR data type is changed to STRING on Hive. This is caused by a number of limitations when converting to VARCHAR on Hive, for example:

On Cloudera, the escape character ('\') is not processed correctly.
Export in binary mode does not process characters with codes 128-255 in varchar column.
In BigInsights 3 there is a problem with synchronizing Hive table metadata with BigSQL: Error Column "COL_CHAR", type "varchar(1)" is not supported

Handling a field delimiter and end of line character in data movement

Hive does not support new line and end of line characters within data. Records including these characters are separated, thus data is handled incorrectly. When querying any data format other than binary (for example text, Avro, Parquet), the new line and end of line characters cannot be handled correctly.

Limitation on TIMESTAMP data type

Date range in BigSQL starts from 1400-01-01. When importing a table to BigSQL with a TIMESTAMP column that contains dates earlier than year 1400, BigSQL shows NULL value for these dates.

Restriction for timestamps at import to Hadoop:

timestamp with time part 24:00:00 may not be interpreted correctly by Hive
timestamps with seconds fractions precision grater than 9 are not interpreted correctly by Hive

In case of using 2 stage processing (with Hive-level conversion) in cases above, data may become corrupted even after exporting back to Db2

Some more potential problems with timestamp and processing by Hive:

past timestamps like 0001-01-01 may become shifted
timestamps being inside daylight saving time (DST) shift in Hadoop timezone may be affected by 1 hour shift

Limitation on DATE and TIMESTAMP containing BC (Before Christ)

Moving data with columns DATE or TIMESTAMP that contain BC (Before Christ), for example ('1400-01-13 BC','1400-01-13 08:13:33 BC'), might fail with error, as such date format is not supported:

Hive does not support BC format as described in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-date

Errors when exporting a Hive table with name or column name starting with underscore

Exporting a table with name or column name starting with _ (underscore) is not supported. As a workaround, you can create a view based on this table, but with name or column name not starting with underscore.

IBM Fast Data Movement does not support spaces in table names when importing a table from Db2 Warehouse

When you try to import a table that contains a space in table name, IBM Fast Data Movement imports data but fails when creating a table in Hive.

Hive displays very low and very high values as infinity

Hive might display some very low and very high numeric values as infinity, for example:

On Db2 Warehouse:

Smallest|-1.79769313486232E+308|-3.402823E+38|-3.402823E+38|-3.402823E+38|-1.79769313486232E+308|-1.79769313486232E+308
Largest negative|-2.2250738585072E-308|-1.175494E-38|-1.175494E-38|-1.175494E-38|-2.2250738585072E-308|-2.2250738585072E-308
Smallest positive|2.2250738585072E-308|1.175494E-38|1.175494E-38|1.175494E-38|2.2250738585072E-308|2.2250738585072E-308
Largest|1.79769313486232E+308|3.402823E+38|3.402823E+38|3.402823E+38|1.79769313486232E+308|1.79769313486232E+308

On Hive:


Smallest                -Infinity       -3.402823E38    -3.402823E38    -3.402823E38    -Infinity       -Infinity
Largest negative        -2.2250738585072E-308   -1.175494E-38   -1.175494E-38   -1.175494E-38   -2.2250738585072E-308   -2.2250738585072E-308
Smallest positive       2.2250738585072E-308    1.175494E-38    1.175494E-38    1.175494E-38    2.2250738585072E-308    2.2250738585072E-308
Largest Infinity        3.402823E38     3.402823E38     3.402823E38     Infinity        Infinity

Infinity values in FLOAT column cannot be exported from Hive

When trying to export a Hive table that contains -/+Infinity value in a FLOAT column, error is thrown.

Hive displays 0 in a numeric column as NULL

Some versions of Hive (observed with 1.2.1) incorrectly display 0 in numeric column of fields DECIMAL(n,n) - instead of 0, NULL is displayed. This problem affects data imported to Hadoop as Hive will not display it correctly. Additionally, in case of 2-stages import to Hadoop (used for Hadoop formats like PARQUET, partitioned or subpartitioned table) zeroes are converted to NULLs while processed by Hive. As a result, data exported back to Db2 Warehouse is also affected.

Direct transfer limitations

If the following conditions are met on a Hadoop cluster:

IBM Fluid Query is installed.
Hive is configured for querying NZBAK (hive.aux.jars.path and HIVE_AUX_JARS_PATH in Hive service configuration)

then data transfer in direct mode with the setting fq.data.directmovement=true is not supported. Possible errors in this case:

ERROR: com.ibm.nz.fq.common.exception.hive.HiveException: Unable to execute Hive sql. Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask.
ERROR: com.ibm.nz.fq.common.exception.hive.HiveException: Unable to execute Hive sql. Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

Delay in execution during export operations on Hortonworks

During the export operation on Hortonworks, here is a substantial time difference between the job submit time and start time. The problem is related to YARN's Capacity Scheduler and it can be observed on clusters with low resources. Possible solution:

Ensure that cluster can handle at least two containers execution in parallel - verify memory allocation for YARN containers, container min/max size.

Modify the following YARN Capacity Scheduler parameters:

yarn.scheduler.capacity.default.minimum-user-limit-percent=20
yarn.scheduler.capacity.maximum-am-resource-percent=0.8
yarn.scheduler.capacity.root.default.user-limit-factor=2