Data movement algorithm

Depending on the type of data that you transfer and the direction of transfer (import or export), two different processes are used for data movement.

  • Data that is in text format and is not partitioned or clustered can be moved in both directions using a simple algorithm, where data is moved to the target directory directly. Note that this is the only algorithm that is supported by BigSQL for data movement.
  • Data that is stored in a Hadoop-specific format, or in Hive partitioned or clustered tables, is transferred in both directions using a two-stage algorithm that utilizes a Hive engine. An INSERT SELECT operation executed on Hive is a part of this process.

Data import

  • The simple algorithm, where the data files are transferred directly to HDFS and stored there in their original format, is used for data in text format, and for tables that are not partitioned or clustered.
  • The two-stage algorithm is used for Hadoop specific formats, Hive partitioned and clustered tables:
    1. First, data is sent in text format to HDFS, where a temporary table is created.
    2. Then, data is converted from text to Hadoop format or a partitioned/clustered table and the text data is deleted.

    Note that you can use the fq.debug.keep.tmptable = true parameter to keep the temporary text table for further analysis, for example, when troubleshooting. For more information, see Verifying intermediate text data files

    There is also a possibility to write data directly to the selected Hadoop format or partitioned tables by setting a property fq.data.directmovement = true, but such setting might entail a performance drop. For more information see Enforcing direct data movement.

Data export

  • The simple process is used for text format - only for data previously imported to Hadoop using the simple algorithm, or for text files stored in HDFS directory provided in fq.input.path parameter. The files are loaded directly to Db2® Warehouse.
  • The two-stage process is used for data imported previously using the two-stage process, and for Hive tables provided in fq.hive.tablename parameter, unless they were imported from NPS using the simple algorithm and their metadata is present in HDFS directory. The two-stage process includes the following steps:
    1. Data is converted to text format using the Hive engine. A temporary Hive table is created and inserted from the source Hive table.
    2. Text data is loaded to Db2 Warehouse.
    There is also a possibility to use direct transfer - in such a case a temporary table is based on the original data format. For more information see Enforcing direct data movement.