Partitioned reads

The Hive Connector stage can be configured to run on multiple processing nodes and read data from the data source.

Usage

To enable Partitioned reads, set the Enable partitioned reads to Yes. When the stage is configured for partitioned reads, it runs the statement that is generated or that is specified in the Select statement property on each processing node. The placeholder in the Select statement ensures that each of the processing nodes retrieves a distinct partition of records from the data source.

The following placeholder can be used in the select statement when Generate SQL is set to No to specify distinct partitions of records to retrieve in individual processing nodes:
  • [[part-value]] – replaced at runtime with the actual partition values for the partition key that is specified in the select statement.
The query must be specified in the following format: SELECT col1, col2 from table_name where partition_key=[[part-value]]

At runtime the placeholder [[part-value]] is replaced with the actual partition values that are present for the specified partition_key and the select statements are prepared to run on individual processing nodes.

Note:
The user-defined SQL statement using the above partition query with placeholder works under the following condition:
  • Partition table should be followed by FROM when multiple tables are used and partition column should be used with the same name as listed in the table without any aliases. While using complex queries, disabling the partitioning is recommended.

Example

Assume that the stage is configured to read records from the data source table HTABLE1, which has the following columns: C1 INTEGER, C2 VARCHAR(10) and partitioned by PC1 INTEGER . Also, assume that the stage is configured to run in parallel on four processing nodes and perform partitioned reads, with Generate SQL set to No.

The Select statement property is set to the following value:
select c1, c2 from htable1 where PC1=[[part-value]]

The connector runs the following statements on each of the processing nodes:
Table 1. Example
Node Statement
1 Select C1,C2 from HTABLE1 where PC1=4
2 Select C1,C2 from HTABLE1 where PC1=1
3 Select C1,C2 from HTABLE1 where PC1=2
4 Select C1,C2 from HTABLE1 where PC1=3
If the stage was configured with Generate SQL as Yes, the statements mentioned above would be generated automatically and run on each of the processing nodes. Assume that the source table contains the following rows:
Table 2. Example
C1 C2 PC1
1 One 1
2 Two 1
3 Three 1
4 Four 2
5 Five 2
6 Six 3
7 Seven 3
8 Eight 3
9 Nine 4
10 Ten 4
The following rows are read by the first processing node (node index 0):
Table 3. Example
C1 C2 PC1
9 Nine 4
10 Ten 4
The following rows are read by the second processing node (node index 1):
Table 4. Example
C1 C2 PC1
1 One 1
2 Two 1
3 Three 1
The following rows are read by the third processing node (node index 2):
Table 5. Example
C1 C2 PC1
4 Four 2
5 Five 2
The following rows are read by the fourth processing node (node index 3):
Table 6. Example
C1 C2 PC1
6 Six 3
7 Seven 3
8 Eight 3