Databricks Query
The Databricks Query executor runs one or more Spark SQL queries on Databricks each time it receives an event record. Use the executor as part of an event stream in the pipeline. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
For example, you might use the Databricks Query executor to run a Spark SQL query that executes the VACUUM command to remove leftover files when the pipeline generates a pipeline stop event.
The Databricks Query executor uses a JDBC connection to connect to the Databricks cluster. When you configure the executor, you specify the JDBC URL and credentials to use to connect to the Databricks cluster, and then you define the Spark SQL queries to run.
When needed, you also define the connection information that the executor uses to connect to the storage location in Amazon S3 or Azure Data Lake Storage Gen2.
You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Before you use the Databricks Query executor, you must complete a prerequisite task.
Prerequisite
Before configuring the Databricks Query executor, prepare your Databricks cluster.
In Databricks, configure and start your Databricks cluster, generate a personal access token, and locate the JDBC URL used to access the cluster.
Spark SQL Queries
You define one or more Spark SQL queries that the Databricks Query executor runs in Databricks each time it receives an event record.
The Databricks Query executor waits for each query to complete before continuing with the next query for the same event record. It also waits for all queries to complete before starting the queries for the next event record. Depending on the speed of the pipeline and the complexity of the queries, the wait for query completion can slow pipeline performance.
When running multiple queries for an event record, the executor skips the subsequent queries if a query fails.
You can include the following elements in each query that you define for the executor:
- Spark SQL
-
Use Spark SQL syntax in a query, such as:
COPY INTO <table identifier> FROM <location> FILEFORMAT = <format type>
You can use any Spark SQL syntax that is valid for Databricks, as described in the Databricks documentation.
- IBM StreamSets expression language functions
-
You can include a subset of the functions provided with the StreamSets expression language in a query. For example, when you define the location to copy the data from, you might use the
record:value()
function to define the location as follows:FROM 's3a://${record:value('/bucket')}/${record:value('/objectKey')}'
When entering the query in the executor, press Ctrl + Space Bar to view the list of valid functions you can use.
Storage Connection
The Databricks Query executor can connect to a storage location before running Spark SQL queries.
- None
- When using the executor to run a Spark SQL query that does not require connecting to a storage location, set the storage location to None.
- Amazon S3
- When using the executor to run a Spark SQL query that requires connecting to an Amazon S3 storage location, you can specify the Amazon S3 connection information in the executor properties. Any connection information specified in the executor properties takes precedence over the connection information configured in the Databricks cluster.
- ADLS Gen2
- When using the executor to run a Spark SQL query that requires connecting to an Azure Data Lake Storage Gen2 storage location, you can specify the Azure connection information in the executor properties. Any connection information specified in the executor properties takes precedence over the connection information configured in the Databricks cluster.
ADLS Gen2 Authentication Information
When you configure the executor to connect to an ADLS Gen2 storage location, you select the authentication method that the executor uses to connect to Azure Data Lake Storage Gen2.
Select one of the following authentication methods:
- OAuth 2.0
- Connections made with OAuth 2.0 authentication require the following
information:
- Application ID - Application ID for the Azure Active Directory Data Collector
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Auth Token Endpoint - OAuth 2.0 token endpoint for the Azure Active
Directory v1.0 application for Data Collector. For example:
https://login.microsoftonline.com/<uuid>/oauth2/token.
- Application Key - Authentication key or client secret
for the Azure Active Directory application. Also known as the
client secret.
For information on accessing the application key from the Azure portal, see the Azure documentation.
- Application ID - Application ID for the Azure Active Directory Data Collector
application. Also known as the client ID.
- Shared Key
-
Connections made with Shared Key authentication require the following information:
- Account FQDN - Fully qualified domain name of the Azure Data Lake
Storage Gen2 account. For example:
<account-name>.dfs.core.windows.net
. - Account Shared Key - Shared access key that Azure
generated for the storage account.
For more information on accessing the shared access key from the Azure portal, see the Azure documentation.
- Account FQDN - Fully qualified domain name of the Azure Data Lake
Storage Gen2 account. For example:
Event Generation
The Databricks Query executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events for each successful or failed query.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses the following event types:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- Successful query
-
The executor generates a successful-query event record after successfully completing a query.
Successful-query event records have thesdc.event.type
record header attribute set tosucessful-query
and include the following fields:Event Field Name Description query Query completed. query-result Number of rows affected by query. Included if the Include Query Result Count in Events property is selected. - Failed query
-
The executor generates a failed-query event record after failing to complete a query.
Failed-query event records have thesdc.event.type
record header attribute set tofailed-query
and include the following field:Event Field Name Description query Query attempted.
Configuring a Databricks Query Executor
Configure a Databricks Query executor to run a Spark SQL query on Databricks upon receiving an event.