Hive Query
The Hive Query executor connects to Hive or Impala and performs one or more user-defined Hive or Impala queries each time it receives an event record. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
Use the Hive Query executor as part of an event stream to perform event-driven queries in Hive or Impala. You can use the executor in any logical way, such as running Hive or Impala queries after the Hive Metadata destination updates the Hive metastore, or after the Hadoop FS or MapR FS destination closes files.
For example, you can use the Hive Query executor to perform the Invalidate Metadata query for Impala as part of the Drift Synchronization Solution for Hive or to configure table properties for newly-created tables.
When using the Hive Query executor with Impala, you can use the default driver included with Data Collector, or you can install an Impala JDBC driver.
When you configure the Hive Query executor, you configure JDBC connection information to Hive, and optionally add additional HDFS configuration properties to use. You can also use a connection to configure the executor. You specify the queries that you want to run and indicate whether to run the remaining queries after a query fails.
You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
For a solution that describes how to use the Hive Query executor, see Automating Impala Metadata Updates for Drift Synchronization for Hive.
Related Event Generating Stages
Use the Hive Query executor in an event stream. The Hive Query executor is designed to run a set of Hive or Impala queries after receiving an event record. You can use the Hive Query executor with any event-generating stage where the logic suits your needs.
When implementing the Drift Synchronization Solution for Hive with Impala, use the executor to run the Invalidate Metadata query after the Hive Metastore destination changes table structures and after the Hadoop FS destination writes files to Hive.
Installing the Impala Driver
You can use the Apache Hive JDBC driver included with Data Collector to perform Impala queries. However, some distributions recommend using a native Impala JDBC driver.
To use the included Apache Hive JDBC driver included with Data Collector, you do not need to perform any additional steps.
- Download the native Impala JDBC driver for the Hive distribution that you use.
- Install the driver as an external library for the stage library used by the Hive
Query executor.
For example, say the executor is configured to use the Cloudera CDP 7.1.8 stage library. You install the driver into the Cloudera CDP 7.1.8 stage library,
streamsets-datacollector-cdp_7_1_8-lib
, which includes the executor.
For information about installing additional drivers, see Install External Libraries in the Data Collector documentation.
If you have trouble determining the URL format to use when configuring the executor, check out this informative blog post.
Hive and Impala Queries
You can use the Hive Query executor to execute a set of Hive or Impala queries each time the executor receives an event record.
The Hive Query executor waits for each query to complete before continuing with the next query for the same event record. It also waits for all queries to complete before starting the queries for the next event record. Depending on the speed of the pipeline and the complexity of the queries, the wait for query completion can slow pipeline performance.
When possible, avoid using Hive Query executor to run long-running queries. Also, when running multiple queries for an event record, you can configure the executor to skip the remaining queries if a query fails. By default, the executor continues to run the rest of the queries.
You can use the fields and attributes in an event record in queries. For example, for event records generated when Hive Metastore creates or updates a table, you can use the table name in the event record to perform additional tasks.
For a list of field names and descriptions in an event record, see the "Event Records" documentation for the event generating stage.
Impala Queries for the Drift Synchronization Solution for Hive
The Drift Synchronization Solution for Hive enables a pipeline to automatically create and update Hive tables and to write files to the tables.
When implementing the Drift Synchronization Solution for Hive with Impala, you can use the Hive Query executor to submit an invalidate metadata query each time you need to update the Impala metadata cache. For a detailed example, see Automating Impala Metadata Updates for Drift Synchronization for Hive.
- Processing event records from the Hive Metastore destination
- The Hive Metastore destination generates an event record each time it
changes a table and places the table name in the "table" record header
attribute. Use the following query to update the Impala metadata cache:
invalidate metadata ${record:attribute('/table')}
- Processing event records from the Hadoop FS destination
- The Hadoop FS destination generates an event record each time it closes a
file. It places the file path in a "filepath" field in the event record. If
you use a separate Hive Query executor for each destination, use the
following query to update the Impala
cache:
invalidate metadata `${file:pathElement(record:value('/filepath'), -3)}`. `${file:pathElement(record:value('/filepath'), -2)}`
Event Generation
The Hive Query executor can generate events that you can use in an event stream.
The executor submits queries to Hive each time it receives an event record. When you enable event generation, the executor generates events each time it determines if the submitted query has completed.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses one of the following event types:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- Successful query
-
The executor generates a successful query event record when Hive successfully completes the query.
A successful query event record has the sdc.event.type record header attribute set to successful-query and includes the following field:Event Field Name Description query The query that Hive ran successfully. - Failed query
-
The executor generates a failed query event record when Hive fails to run the query.
A failed query event record has the sdc.event.type record header attribute set to failed-query and can include the following fields:Event Field Name Description query The query that Hive failed to run. unexecuted-queries Any additional queries that did not execute. The event record includes this field only when you configure the executor to skip running subsequent queries when a query fails.
Configuring a Hive Query Executor
Configure a Hive Query executor to execute a query on Hive or Impala when the executor receives an event record.