Reading partitioned data

If the Hive connector stage is configured to run on multiple processing nodes, each of the processing nodes reads data from the data source concurrently with other processing nodes. The partitions of records from all the processing nodes are combined to produce the complete record data set for the output link of the stage.

Before you begin

Add the Hive Connector stage to a parallel job.

About this task

When a Hive connector stage is configured to perform partitioned reads, each of the processing nodes of the stage reads a portion of data from the data source and the records retrieved by all the processing nodes are combined to produce the result set for the output link. The connector runs a slightly modified SELECT statement on each node. The combined set of rows from all of the queries is the same set of rows that would be returned if the unmodified SELECT statement were run once on one node.

Procedure

  1. On the job design canvas, double-click the Hive Connector stage, and then click the Stage tab.
  2. On the Advanced page, set Execution mode to Parallel, or Default(Parallel), and then click the Output tab.
  3. Define the SELECT statement that the connector uses at run time:
    • Set Generate SQL at runtime to No, and then specify the SELECT statement in the Select statement property.
    • Set Read select statement from file to Yes, specify the name of the file in Select statement property and include the SELECT statement in that file.
    • Set Generate SQL at runtime to Yes and then specify the name of the table in the Table name property. Use the syntax table_name or owner.table_name, where owner is the owner of the table.
  4. Set Enable partitioned reads to Yes.
  5. Set the Partitioned read methods property to the partitioning method that you want to use. The default partitioning method is Hive partition.
  6. Specify the input values that the partitioned read methods uses:
    1. If you choose the Minimum and Maximum range partitioned read method, in the Partitioned reads method > Table name property, specify the name of the table that the partitioned read method uses to define the subsets of data that each node reads from the source table. In the Partitioned reads method > Column name property, specify the name of the column that the partitioned read method uses to define the subsets of data that each node reads from the source. The column must be a TINYINT, SMALLINT, INTEGER or BIGINT data type. If the Generate SQL at runtime to No, use the [[range-part-expr]] placeholder in the SELECT statement to control which portions of data are fetched on the individual processing nodes. You can set the Partitioned reads method > Generate partitioning SQL property to Yes and the necessary clause will be added automatically.
    2. If you chose the Modulus partitioned read methods, in the Partitioned reads method > Column name property, specify the name of the column that the partitioned read method uses to define the subsets of data that each node reads from the source The column must be a TINYINT, SMALLINT, INTEGER or BIGINT data type. If the Generate SQL at runtime is set to No, use the [[mod-part-expr]] placeholder in the SELECT statement to control which portions of data are fetched on the individual processing nodes. You can set the Partitioned reads method > Generate partitioning SQL property to Yes and the necessary clause will be added automatically.
    3. If you chose the Hive partition partitioned read method and the Generate SQL at runtime is set to No, use the [[part-value]] placeholder in the SELECT statement to control which portions of data are fetched on the individual processing nodes. You can set the Partitioned reads method > Generate partitioning SQL property to Yes and the necessary clause will be added automatically.
  7. Click OK, and then save the job.