Hive
The Hive origin reads data from a Hive table. Hive is a transactional storage layer that works on top of Hadoop Distributed File System (HDFS). Hive stores files in tables on HDFS.
By default, the origin reads from Hive using connection information stored in Hive configuration files on the Transformer machine. Alternatively, the origin can use connection information stored in an external Hive Metastore that you specify.
When you configure the origin, you indicate if the origin should run in incremental mode or full query mode. You define the query to use, the offset column, and optionally, an initial offset to use. When needed, you can specify URIs for an external Hive Metastore where configuration information is stored.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.
With a Hive origin, Spark determines partitioning based on the partitioning configured within the Hive source table. If no partitioning is configured, the origin reads all available data within a single partition.
Spark uses partitions created by the origin throughout the pipeline unless a processor causes Spark to shuffle the data. To change the partitioning in the pipeline, use the Repartition processor.
Reading a Delta Lake Managed Table
You can use the Hive origin to read a Delta Lake managed table in streaming execution mode, or in batch execution mode with offset tracking. For all other cases, such as reading an unmanaged table, or reading in batch execution mode without offset tracking, use the Delta Lake origin.
Incremental and Full Query Mode
The Hive origin can run in full query mode or in incremental mode. By default, the origin runs in full query mode.
When the origin runs in full query mode with no initial offset defined, the origin processes all available data each time the pipeline runs. If you configure an initial offset, the origin starts the read with the initial offset each time the pipeline runs.
When the origin runs in incremental mode, the first pipeline run is the same as full query mode: the origin starts the read with the initial offset, if defined. Otherwise, it reads all available data. When the pipeline stops, the origin stores the offset where it stopped processing. For subsequent pipeline runs, the origin starts the read with the last-saved offset.
SQL Query
The SQL query defines the data returned from Hive. You can use any valid SQL query but query guidelines depend on whether you configure the origin to run in incremental mode or full query mode.
Incremental Mode Guidelines
When you define the SQL query for incremental mode, the Hive origin requires a WHERE clause and an ORDER BY clause in the query.
- WHERE clause
- In the WHERE clause, include the offset column and the
offset value. Use the
$offset
variable to represent the offset value. - ORDER BY clause
- In the ORDER BY clause, include the offset column as the
first column to avoid returning duplicate data.Note: Using a column that is not a primary key or indexed column in the ORDER BY clause can slow performance.
Full Mode Guidelines
SELECT * FROM invoice
When you define the SQL query for full mode, you can optionally include the WHERE and ORDER BY clauses using the same guidelines as for incremental mode. However, using these clauses to read from large tables can cause performance issues.
Additional Hive Configuration Properties
When needed, you can pass additional Hive configuration properties to Hive as additional Spark configuration properties. Simply define Extra Spark Configuration properties on the Cluster tab of the pipeline properties panel.
Hive configuration properties defined in the pipeline override those defined in Hive configuration files.
Configuring a Hive Origin
Configure a Hive origin to read from a Hive table.