If the Hive connector stage is configured to run on multiple processing nodes, each of
the processing nodes reads data from the data source concurrently with other processing nodes. The
partitions of records from all the processing nodes are combined to produce the complete record data
set for the output link of the stage.
Add the Hive Connector stage to a parallel job.
When a Hive connector stage is configured to perform partitioned reads, each of the
processing nodes of the stage reads a portion of data from the data source and the records retrieved
by all the processing nodes are combined to produce the result set for the output link. The
connector runs a slightly modified SELECT statement on each node. The combined set of rows from all
of the queries is the same set of rows that would be returned if the unmodified SELECT statement
were run once on one node.
-
On the job design canvas, double-click the Hive Connector stage, and then click the
Stage tab.
-
On the Advanced page, set Execution mode to
Parallel, or Default(Parallel), and then click the
Output tab.
-
Define the SELECT statement that the connector uses at run time:
- Set Generate SQL at runtime to No, and then
specify the SELECT statement in the Select statement property.
- Set Read select statement from file to Yes,
specify the name of the file in Select statement property and include the
SELECT statement in that file.
- Set Generate SQL at runtime to Yes and then
specify the name of the table in the Table name property. Use the syntax
table_name or owner.table_name, where owner is the owner
of the table.
-
Set Enable partitioned reads to Yes.
- Set the Partitioned read methods property to the partitioning
method that you want to use. The default partitioning method is Hive
partition.
- Specify the input values that the partitioned read methods uses:
- If you choose the Minimum and Maximum range partitioned read method, in
the Partitioned reads method > Table name property,
specify the name of the table that the partitioned read method uses to define the subsets of data
that each node reads from the source table. In the Partitioned reads method >
Column name property, specify the name of the column that the partitioned
read method uses to define the subsets of data that each node reads from the source. The column must
be a TINYINT, SMALLINT, INTEGER or BIGINT data type. If the Generate SQL at
runtime to No, use the [[range-part-expr]]
placeholder in the SELECT statement to control which portions of data are fetched on the individual
processing nodes. You can set the Partitioned reads method >
Generate partitioning SQL property to Yes and the
necessary clause will be added automatically.
- If you chose the Modulus partitioned read methods, in the
Partitioned reads method > Column name property,
specify the name of the column that the partitioned read method uses to define the subsets of data
that each node reads from the source The column must be a TINYINT, SMALLINT, INTEGER or BIGINT data
type. If the Generate SQL at runtime is set to No, use
the [[mod-part-expr]] placeholder in the SELECT statement to control which
portions of data are fetched on the individual processing nodes. You can set the
Partitioned reads method > Generate partitioning SQL
property to Yes and the necessary clause will be added automatically.
- If you chose the Hive partition partitioned read method and the
Generate SQL at runtime is set to No, use the
[[part-value]] placeholder in the SELECT statement to control which portions of
data are fetched on the individual processing nodes. You can set the Partitioned reads
method > Generate partitioning SQL property to
Yes and the necessary clause will be added automatically.
- Click OK, and then save the job.