Importing data to partitioned and clustered Hive tables
You can import tables to Hadoop taking advantage of the Hive partitioning and clustering
features.
Partitioning
Using Hive, you can organize tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.
Clustering
You can also sub-divide tables into the so-called buckets, to provide extra structure to the data that may be used for more efficient querying. Clustering works based on the value of hash function of some column or columns of a table.Clustered tables
decrease time of execution of queries with join
clause in Hive tables. You provide
one or more columns, and a number of buckets for clustering. Buckets are files that contain data,
they are created in HDFS.
Limitations
The following restrictions and limitations apply to using Hive partitioning and clustering with
IBM® Fast Data Movement:
- Partitioning is dependent on your Hadoop cluster resources. With big volume of data and high number of partitions the import might fail on insert into destination table (the 2nd stage of import algorithm, as described in Data movement algorithm). To troubleshoot such problems you can import data to non-partitioned table using IBM Fast Data Movement and then manually INSERT SELECT from this table to the destination partitioned table. This approach helps to resolve Hadoop cluster configuration or resource issues.
- You can only use remote mode of import and export.
- The use of BigSQL is not supported. You must specify Hive2 as a service for
fq.sql.type
property in the configuration file. - While querying works faster, the import operation to clustered or partitioned tables might take longer.
- You can only import one table at a time.
- When using partitioning, you cannot specify all columns that exist in a table that you want to import. This is a limitation on Hive side. When you specify all columns in a table for partitioning, the error message is displayed.
Usage
To import a table from Db2® Warehouse to Hadoop you use the
configuration XML file for remote mode. In the file, configure the following parameters:
- For
partitioning:
The<property> <name>fq.hive.partitioned.by</name> <value></value> <description>Column(s) in a table that will be used for partitioning</description> </property>
fq.hive.partitioned.by
property determines by which column(s) the imported table will be partitioned. You can provide multiple columns divided by comma. By default the field is empty, that is, partitioning is disabled.Note: You cannot specify all columns that exist in a table for partitioning. You must exclude at least one column. This is a limitation imposed by Hive. - For
clustering:
There are two ways in which you can set clustering:<property> <name>fq.hive.clustered.by</name> <value>id</value> <description>Column(s) in a table that will be used for clustering. To enable default clustering, keep this property empty and set fq.hive.cluster.buckets.</description> </property> <property> <name>fq.hive.clustered.buckets</name> <value>252</value> <description>Integer value which determines the number of buckets that will be created during clustering. This property cannot be empty when using clustering.</description> </property>
- You can use default clustering that is, use the same value that was set for the table on Db2 Warehouse with DISTRIBUTE ON parameter. To use default clustering, leave fq.hive.clustered.by empty and only set a number of buckets in fq.hive.clustered.buckets.
- You can cluster by specific columns of your choice. To create such explicit distribution key, provide one or more column names in fq.hive.clustered.by. Also, set the number of buckets in fq.hive.clustered.buckets.
Exporting a clustered or partitioned table to Db2 Warehouse does not require any additional settings.