Databricks
You can run Transformer pipelines using Spark deployed on a Databricks cluster. Transformer supports several Databricks versions. For a complete list, see Cluster Compatibility Matrix.
To run a pipeline on a Databricks cluster, configure the pipeline to use Databricks as the cluster manager type on the Cluster tab of pipeline properties.
Transformer uses the Databricks REST API to perform tasks on Databricks clusters, such as submitting an ephemeral Databricks job to run the pipeline. Databricks retains details about ephemeral jobs for 60 days. When necessary, access job details while they are available.
To use Google stages in pipelines running on a Databricks cluster, you must configure specific Spark properties.
When you configure a pipeline to run on a Databricks cluster, you can specify an existing interactive cluster to use or you can have Transformer provision a job cluster to run the pipeline.
In pipelines that use an existing interactive cluster, you must specify any extra Spark configuration properties in Databricks. This requires you to restart the cluster. For details about specifying Spark configuration properties, see the Databricks documentation.
For both interactive and provisioned clusters, you define the staging directory within the Databricks File System (DBFS) to store the Transformer libraries and resources needed to run the pipeline. You also specify the URL and credentials used to connect to your Databricks account. When you start a pipeline, Transformer uses these credentials to launch the Spark application.
The following image displays a pipeline configured to run on Spark deployed to an existing Databricks cluster on Microsoft Azure:
Spark Properties for Google Stages
To use Google stages in pipelines running on a Databricks cluster, you must configure specific Spark properties.
In pipelines that use existing clusters, you must configure the Spark properties in Databricks. For details, see the Databricks documentation. In pipelines that provision clusters, you can configure the properties in the Extra Spark Configuration property of the pipeline.
Spark Property | Description |
---|---|
spark.hadoop.google.cloud.auth.service.account.enable | Flag that indicates whether to enable the Google Cloud authentication service. Set to true. |
spark.hadoop.fs.gs.auth.service.account.email | Client email address. |
spark.hadoop.fs.gs.project.id | Project ID. |
spark.hadoop.fs.gs.auth.service.account.private.key | Private key. |
spark.hadoop.fs.gs.auth.service.account.private.key.id | Private key ID. |
Existing Cluster
You can configure a pipeline to run on an existing Databricks interactive cluster.
When a Databricks cluster runs a Transformer pipeline, Transformer libraries are installed on the cluster so they can be reused. Pipelines from different versions of Transformer cannot run on the same Databricks cluster.
For example, say you have a cluster that previously ran pipelines built on Transformer 5.8.0. When you build new pipelines using Transformer 6.1.0, the new pipelines cannot run on that cluster.
In this situation, you can run the pipeline on a different existing cluster or configure the pipeline to provision a cluster. If the existing cluster no longer runs pipeline from the older Transformer version, you can uninstall the older Transformer libraries from the cluster and use the cluster to run pipelines from the newer Transformer version.
To run a pipeline on an existing Databricks cluster, clear the Provision a New Cluster property on the Cluster tab, then specify the ID of the cluster to use. You must configure any extra Spark configuration properties in Databricks. This requires you to restart the cluster. For details about specifying Spark configuration properties, see the Databricks documentation.
Uninstalling Transformer Libraries
For example, say you have a cluster that previously ran pipelines built on Transformer 5.8.0. When you build new pipelines using Transformer 6.1.0, the new pipelines cannot run on that cluster.
To enable a cluster to run pipelines from a different version of Transformer, uninstall the existing Transformer libraries from the cluster. Perform this task when you no longer want to run pipelines from the other version of Transformer.
The following details are provided for your convenience. If the Databricks workflow changes, please check the Databricks documentation for updated steps.
Provisioned Cluster
You can configure a pipeline to run on a provisioned cluster. When provisioning a cluster, Transformer creates a new job cluster on the initial run of a pipeline.
You can provision a cluster that uses an instance pool. You can configure the cluster to execute cluster-scoped init scripts before processing data. You can optionally have Transformer terminate the cluster after the pipeline stops.
To provision a cluster for the pipeline, use the Provision a New Cluster property on the Cluster tab of the pipeline properties. Then, define the cluster configuration to use.
Cluster-Scoped Init Scripts
When you provision a Databricks cluster, you can specify cluster-scoped init scripts to execute before processing data. You might use init scripts to perform tasks such as installing a driver on the cluster or creating directories and setting permissions for them.
- Unity Catalog from Pipeline 6.1 and later - Unity Catalog init script
defined in the pipeline. When provisioning the cluster, Transformer temporarily stores the
script in the specified Unity Catalog staging directory and removes it after the pipeline run.Note: Using this option requires enabling the Use Unity Catalog property.
- Unity Catalog from Location 6.1 and later - Unity Catalog init script
stored on Unity Catalog. Note: Using this option requires enabling the Use Unity Catalog property.
- S3 from Location - Amazon S3 init script stored on AWS. Use only when provisioning a Databricks cluster on AWS.
- ABFSS from Location - Azure init script stored on Azure Blob File System (ABFS).
Use only when provisioning a Databricks cluster on Azure.Note: To use this option, you must provide an access key to access the init script.
When you specify more than one init script, place them in the order that you want them to run. If a script fails to run, Transformer cancels the cluster provisioning and stops the job.
You can use any valid Databricks cluster-scoped init script. For more information about Databricks cluster-based init scripts, see the Databricks documentation.
Configure cluster-scoped init script properties on the Cluster tab of the pipeline properties. After you select the Provision a New Cluster property, you can configure the init script properties.
Access Keys for ABFSS Init Scripts
To use Azure cluster-scoped init scripts stored on Azure Blob File System, you must provide an ADLS Gen2 access key for the storage account where the scripts are located. When using init scripts stored in different storage accounts, provide an access key for each storage account.
- On the Cluster tab of the pipeline properties, in the Extra Spark Configuration
property, add the following property:
spark.hadoop.fs.azure.account.key.<storage-account-name>.dfs.core.windows.net
<storage-account-name>
is the name of the Azure Data Lake Storage Gen2 storage account where the script is located. - Set the value of the property to the access key to the Azure Data Lake Storage
Gen2 storage account.
For steps on finding the access key for your storage account, see Get an Azure ADLS Access Key in the Azure Databricks documentation.
Cluster Configuration
When provisioning a cluster for a pipeline, Databricks creates a new Databricks job cluster upon the initial run of a pipeline. You define the Databricks cluster properties to use in the Cluster Configuration pipeline property. Transformer uses Databricks default values for all Databricks cluster properties that are not defined in the Cluster Configuration pipeline property.
When needed, you can override the Databricks default values by defining additional
cluster properties in the Cluster Configuration pipeline property. For example, to
provision a cluster that uses an instance pool, you can add and define the
instance_pool_id
property in the Cluster Configuration property.
When defining cluster configuration properties, use the property names and values as expected by Databricks. The Cluster Configuration property defines cluster properties in JSON format.
Databricks Cluster Property | Description |
---|---|
num_workers | Number of worker nodes in the cluster. |
spark_version | Databricks Runtime and Apache Spark version. |
node_type_id | Type of worker node. |
For information about other Databricks cluster properties, see the Databricks documentation.
Using an Instance Pool
When you configure the pipeline to provision a new Databricks cluster, you can have the provisioned cluster use an existing instance pool.
To have the provisioned cluster use an
instance pool, include the Databricks instance_pool_id
property in the
Cluster Configuration pipeline property, and set it to the instance pool ID that you
want to use.
For example, the following set of properties provisions a cluster to run the pipeline that uses the specified instance pool, then terminates the cluster after the pipeline stops:
Locating Properties in Databricks
To locate the valid cluster configuration property names and values, launch your Databricks workspace and view the properties used to create a job cluster.
Staging Directory
To run pipelines on a Databricks cluster, Transformer must store files in a staging directory.
When you configure a pipeline, you define the the staging directory to use. By default, pipelines store files on Databricks File System (DBFS), and the default staging directory is /streamsets.
6.1 and later
You can alternatively configure pipelines to store files in Unity Catalog. When storing files on Unity Catalog, specify a
staging directory with the following root directory: /Volumes
. For more information about using Unity Catalog, see Staging Data on Unity Catalog.
When a pipeline runs on an existing interactive cluster, configure pipelines to use the same staging directory so that each job created within Databricks can reuse the common files stored in the directory. Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories. Different Transformer instances cannot send pipelines to the same cluster.
When a pipeline runs on a provisioned job cluster, using the same staging directory for pipelines is best practice, but not required.
- Files that can be reused across pipelines
- Transformer stores files that can be reused across pipelines, including Transformer libraries and external resources such as JDBC drivers, in the following location:
- Files specific to each pipeline
- Transformer stores files specific to each pipeline, such as the pipeline JSON file and resource files used by the pipeline, in the following directory:
Staging Data on Unity Catalog
6.1 and later You can configure a Databricks pipeline to use a staging directory on Unity Catalog instead of using the default, DBFS. This functionality is available with Transformer 6.1 and later.
Transformer handles external resources differently for pipelines that use a Unity Catalog staging directory. For more information, see External Resources.
When you use a Unity Catalog staging directory, you can also configure the pipeline to use a Unity Catalog init script when provisioning a cluster.
- In the pipeline, on the Cluster tab, select Use Unity Catalog.
- Change the default value for the Staging Directory to an appropriate location, starting with
/Volumes
.
For guidelines on when to use Unity Catalog instead of DBFS, see the Databricks documentation.
External Resources
External resources are external libraries or files, such as a driver or other details, that you want a pipeline to use. External resources can include runtime resource files.
When a pipeline runs, Transformer uploads external resource files to a specified location. By default, after the pipeline completes, Transformer removes the files from the directory. You can configure Transformer to cache external resource files for reuse.
Transformer external resource handling differs depending on the staging location and other configurations:
- Databricks File System (DBFS)
- When you use a staging directory on DBFS, by default, Transformer shares the external resources with both the Spark driver and Spark executors without additional configuration.
- Unity Catalog 6.1 and later
- When you use a staging directory on Unity Catalog, Transformer shares staged external
resources with only the Spark driver. As with DBFS staging directories, Transformer uploads the external
resources to the following location, by
default:
/<staging_directory>/staging/<pipelineId>/<runId>/externalResources
Caching External Resource Files
You can configure Transformer to cache external resource files for reuse. External resource files can include runtime resource files.
By default, Transformer uploads external
resource files to the following location for each pipeline run: /<staging_directory>/staging/<pipelineId>/<runId>/externalResources
. After a pipeline run completes, Transformer removes the files from
the directory.
If your pipelines use only a few external resource files, the default behavior may be appropriate. If your pipelines use large numbers of external resource files, then uploading and removing them for each pipeline run can be time consuming.
When needed, you can configure Transformer to cache external resource files so they can be reused by multiple pipelines and across multiple pipeline runs.
When caching is enabled, the first time that you run a pipeline with a large number of external resource files, the pipeline will take longer to initialize as it uploads those files to the directory.
- Databricks file system (DBFS)
- When a pipeline uses a DBFS staging directory, Transformer can cache the pipeline
external resource files for reuse.
To enable Transformer to cache external resource files for Databricks pipelines, uncomment the
When enabled, Transformer caches external resources in the following location:transformer.databricks.external.resources.cache
property in the Transformer configuration properties of the deployment, and set the property totrue
./<stagingDirectory>/<engineId>/externalResources
- Unity Catalog 6.1 and later
- By default, a pipeline using a Unity Catalog staging directory can only share external resources
with Spark drivers. Transformer ignores the
transformer.databricks.external.resources.cache
property and does not cache external resources.
Limiting Staging Directory Access
You can configure Transformer to temporarily lock the Databricks workspace to limit access to staging directories in the workspace by other Transformer engines. In most cases, limiting access to the Databricks workspace is not necessary.
Transformer accesses the staging directory defined in a pipeline each time a pipeline starts. Databricks can generate timeout errors when different Transformer engines try to access staging directories in the same Databricks workspace at the same time, and when those pipelines require uploading a large number of external resource files. Errors can also occur when the Databricks workspace is otherwise heavily loaded.
You might prevent this by staggering the start times of pipelines with large numbers of external resource files to upload, or by caching external resource files so Transformer does not need to upload the files with each pipeline run.
However, if Databricks timeout errors persist, you can address the issue by configuring Transformer to lock the Databricks workspace when a pipeline starts. Transformer releases the lock after submitting the Spark job for the pipeline.
Retrying Pipelines
transformer.databricks.run.max.retries
- Defines how many times Transformer retries a Databricks pipeline after it fails to start. Default is 2.transformer.databricks.run.retry.interval
- Defines the number of milliseconds to wait between retries. Default is 10,000, which is 10 seconds.
When needed, you can configure these properties in the Transformer configuration properties of the deployment.
Accessing Databricks Job Details
When you run a Databricks pipeline, Transformer submits an ephemeral job to the Databricks cluster. An ephemeral job is one that runs only once and does not count towards the Databricks job limit. However, job details do not display in the Databricks job menu.
Databricks retains details for ephemeral jobs for 60 days. Use one of the following methods to access details about a Databricks job:
- After the job completes, on the History tab of the job, click View Summary for the job run. Use the Databricks Job URL link that displays in the Job Metrics Summary.
- Use the
jobs/runs/get
Databricks API to check the run state of the workloads.