Databricks
The Databricks target writes data to one or more Delta Lake tables on Databricks. For information about supported versions, see Supported systems and versions.
Use the Databricks target for the following use cases:
- Bulk load new data into Delta Lake tables
- Build a flow that bulk loads new data into Delta Lake tables on Databricks. When processing new data, the target uses the COPY command to load data into Delta Lake tables. For a detailed solution of how to design this flow, see Bulk loading data into a Delta Lake table.
- Merge changed data into Delta Lake tables
- Build a flow that reads change data capture (CDC) data from a database and replicates the changes to Delta Lake tables on Databricks. When processing CDC data, the target uses the MERGE command to load data into Delta Lake tables. For a detailed solution of how to design this flow, see Merging changed data into a Delta Lake table.
The Databricks target first stages the flow data in text files in Amazon S3, Azure Data Lake Storage Gen2, or Google Cloud Storage. Then, the target sends the COPY or MERGE command to Databricks to process the staged files.
The Databricks target uses a JDBC URL to connect to the Databricks cluster. When you configure the target, you specify the JDBC URL and credentials to use to connect to the cluster. You can also use a connection to configure the target. You define the information that the target uses to connect to the staging location.
You specify the tables in Delta Lake to write the data to. The target writes data from record fields to table columns based on matching names. You can configure the target to compensate for data drift by creating new columns in existing database tables when new fields appear in records or by creating new database tables. When the target performs data drift and creates new tables, you can configure it to partition the tables based on specified fields in the record.
You can configure the root field for the row, and any first-level fields that you want to exclude from the record. You can also configure the target to replace missing fields or fields with invalid data types with user-defined default values and to replace newline characters in string fields with a specified character. You can specify the quoting mode, define quote and escape characters, and configure the target to trim spaces.
The Databricks target can use CRUD operations defined in the
sdc.operation.type record header attribute to write data. For information about Data Collector change data processing and a list of
CDC-enabled sources, see Processing changed data.
Before you use the Databricks target, you must complete a prerequisite task.
Prerequisite
Before configuring the Databricks target, prepare your Databricks cluster.
In Databricks, configure and start your Databricks cluster, generate a personal access token, and locate the JDBC URL used to access the cluster.
- When using Amazon S3 as the staging location, see this Databricks article.
- When using Azure Data Lake Storage Gen2 as the staging location, see this Azure Databricks article.
- When using Google Cloud Storage as the staging location, see this Databricks article.
Load methods
The Databricks target can load data to Delta Lake tables using the following methods:
- COPY command for new data
- The COPY command, the default load method, performs a bulk synchronous load to Delta Lake, treating all records as INSERTS. Use this method to write new data to Delta Lake tables.
- MERGE command for CDC data
- Instead of treating all records as INSERT, the MERGE command inserts, updates, upserts, and deletes changed data to Delta Lake tables as appropriate. Use this method to write change data capture (CDC) data to Delta Lake tables using CRUD operations.
Use the recommended guidelines to optimize for performance and cost-effectiveness.
CRUD operation processing
When you configure the Databricks target to use the MERGE command to load CDC data, the target can insert, update, upsert, or delete data.
sdc.operation.type record header attribute. The target
performs operations based on the following numeric values:- 1 for INSERT
- 2 for DELETE
- 3 for UPDATE
- 4 for UPSERT
If
your flow has a CRUD-enabled source that processes changed data, the
target simply reads the operation type from the sdc.operation.type
header attribute that the source generates. If your flow has a non-CDC
source, you can use the Expression Evaluator processor or a scripting processor to
define the record header attribute. For more information about Data Collector changed data processing and a list of
CDC-enabled sources, see Processing changed data.
Primary key location
When you configure the Databricks target to use the MERGE command to load CDC data, you must provide primary key information.
You can configure the target to access primary key information in the following ways:
- Use JDBC record header attribute
- The target uses primary key columns stored in the
jdbc.primaryKeySpecificationrecord header attribute. - Use specified columns for each table
- The target uses the primary key columns that you specify in the Table Key Columns properties.
Specifying tables
You can use the Databricks target to write to one or more tables. The target writes data from record fields to the table columns based on matching names.
- Single table on Databricks File System (DBFS)
- To write data to a single table on DBFS, enter the name of the database and table using the
following format:
<database_name>.<table_name> - Single external table
- To write data to a single external table, enter the name of the catalog, schema, and table using
the following format:
<catalog_name>.<schema_name>.<table_name> - Multiple tables
- To write data to multiple tables, specify a field in the record that defines the database and tables to write to.
Use the Table Name property on the Databricks tab to specify the tables to write to.
Enabling data drift handling
The Databricks target can automatically compensate for changes in column or table requirements, also known as data drift.
- Create new columns
- The target can create new columns in Delta Lake tables when new fields appear in
records. For example, if a record suddenly includes a new
Address2field, the target creates a newAddress2column in the target table. - Create new tables
- When data drift handling is enabled, you can also configure the target to create
new tables as needed. For example, say the target writes data to tables based on the
region name in the
Regionfield. When a newSW-3region shows up in a record, the target creates a newSW-3table in Delta Lake and writes the record to the new table.
Partitioning tables
When you enable data drift handling and automatic table creation, you can configure the Databricks target to partition the tables and write to those partitions.
When partitioning tables, you specify the name of the table to partition and the name of the record field to become the partition column.
When appropriate, you can configure the target to use the same partition column for all created tables. To do this, use an asterisk ( * ) for the Table property. Then, the target uses the specified field as a partition column for all tables created by the target.
For example, the following configuration uses the country field as a
partition in all tables, and the region field for the
customers table:

Performance optimization
Use the following tips to optimize for performance and cost-effectiveness when using the Databricks target:
- Increase the batch size
- The maximum batch size is determined by the source in the flow and typically has a default value of 1,000 records. To take advantage of the Databricks loading abilities when writing to Delta Lake tables using the COPY or MERGE commands, increase the maximum batch size in the flow source to 20,000-50,000 records. Be sure to increase the Data Collector java heap size, as needed.
- Use multiple threads
- When writing to Delta Lake tables using the COPY command, you can use multiple threads to improve performance by including a multithreaded source in the flow. When Data Collector resources allow, using multiple threads enables processing multiple batches of data concurrently. As with increasing the batch size, when using multiple threads, you should make sure that the Data Collector is sized appropriately.
- Enable additional connections to Databricks
- When writing to multiple Delta Lake tables using the COPY or MERGE commands, increase the number of connections that the Databricks target makes to Databricks. Each additional connection allows the target to write to an additional table, concurrently.
Staging location
The Databricks target first stages the flow data in text files in a specified location. Then, the target sends the COPY or MERGE command to Databricks to process the staged files.
- Amazon S3
- After selecting Amazon S3 as the staging location, specify the existing S3 bucket to stage the files to. You also specify the credentials that the target uses to connect to Amazon S3.
- ADLS Gen2
- After selecting ADLS Gen2 as the staging location, specify the name of the existing Azure account and storage container to stage the files to. You then configure the target to use the appropriate authentication method to connect to Azure Data Lake Storage Gen2.
- Google Cloud Storage
- After selecting Google Cloud Storage as the staging location, specify the project ID and an existing Google Cloud Storage bucket to use. You also define credential provider details.
For all staging locations, you specify the stage file name prefix and whether the target removes a staged file after its data is written to Delta Lake tables.
Amazon S3 credentials
When you configure the target to connect to an Amazon S3 staging location, the target must pass credentials to Amazon Web Services.
- Instance profile
- When Data Collector runs on an Amazon EC2 instance that has an associated instance profile, Data Collector uses the instance profile credentials to automatically authenticate with AWS.
- AWS access key pair
-
When Data Collector does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you can connect with an AWS access key pair.
To connect with an AWS access key pair, on the Staging tab, set the Authentication Method property to AWS Keys. Then define the access key ID and secret access key.Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores.
ADLS Gen2 authentication information
When you configure the target to connect to an ADLS Gen2 staging location, you select the authentication method that the target uses to connect to Azure Data Lake Storage Gen2.
- OAuth 2.0
- Connections made with OAuth 2.0 authentication require the following
information:
- Application ID - Application ID for the Azure Active Directory Data Collector
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Application Key - Authentication key or client secret
for the Azure Active Directory application. Also known as the
client secret.
For information on accessing the application key from the Azure portal, see the Azure documentation.
- Auth Token Endpoint - OAuth 2.0 token endpoint for
the Azure Active Directory v1.0 application for Data Collector. For example:
https://login.microsoftonline.com/<uuid>/oauth2/token.
- Application ID - Application ID for the Azure Active Directory Data Collector
application. Also known as the client ID.
- Shared Key
- Connections made with Shared Key authentication require the following
information:
- Account Shared Key - Shared access key that Azure
generated for the storage account.
For more information on accessing the shared access key from the Azure portal, see the Azure documentation.
- Account Shared Key - Shared access key that Azure
generated for the storage account.
Google Cloud credentials
Before staging data on Google Cloud Storage, the Databricks target must pass credentials to Google Cloud.
- Google Cloud default credentials
- Credentials in a file
- Credentials in a stage property
For details on how to configure each option, see Security in Google Cloud stages.
Row generation
When writing a record to a table, the Databricks target includes all
record fields in the resulting row, by default. The target uses the root field,
/, as the basis for the resulting row.
You can configure the Row Field property on the Data tab to specify a map or list-map field in the record as the basis for the row. The resulting record includes only the data from the specified map or list-map field and excludes all other record data. Use this functionality when the data that you want to write to Delta Lake tables exists in a single map or list-map field within the record.
When you want to use the root field, but do not want to include all fields in the resulting row, configure the target to ignore all specified first-level fields.
/, as the basis for the resulting row. A record contains the following
fields:{
"name": "Jane Smith",
"id": "557",
"address": {
"street": "101 3rd St",
"city": "Huntsville",
"state": "NC",
"zipcode": "27023"
}
}The target treats the address map field as a field with an invalid
data type, processing the field as an error record by default. You can configure the
target to ignore the field and process the remaining record data, as described in
Missing fields and fields with invalid types.
Missing fields and fields with invalid types
By default, the target treats records with missing fields or with invalid data types in fields as error records.
You can configure the target to ignore missing fields or fields with invalid data types, replacing the data in the field with an empty value.
The default for each data type is \N, which represents an empty
value in Delta Lake. You can specify a different default value to use for each data type on the Data
Defaults tab. For example, you might define the default value for a missing String field or a String
field with an invalid data type as none or not_applicable.
To configure the target to ignore missing fields and fields with invalid data types, select the Ignore Missing Fields and the Ignore Fields with Invalid Types properties on the Data tab.
Databricks data types
The Databricks target converts Data Collector data types into Databricks data types before writing data to Delta Lake tables.
When you configure the target to compensate for data drift, you can also configure the target to create all new columns as String. However, by default, the target converts record data to the appropriate data type.
The target does not support nested Data Collector data types: List, List-Map, and Map. By default, the target treats fields with invalid data types as an error record. You can configure the target to ignore fields with invalid data types, as described in Missing fields and fields with invalid types.
The target converts the following Data Collector data types into these Databricks data types:
| Data Collector Data Type | Databricks Data Type |
|---|---|
| Boolean | Boolean |
| Byte | Tinyint |
| Byte_Array | Binary |
| Char | String |
| Date | Date |
| Datetime | Timestamp |
| Decimal | Decimal |
| Double | Double |
| Float | Float |
| Integer | Int |
| Long | Bigint |
| Short | Smallint |
| String | String |
| Time | Timestamp |
| Zoned_Datetime | Date |
Configuring a Databricks target
About this task
Configure a Databricks target to write data to one or more Delta Lake tables on Databricks. Before you configure the target, be sure to complete the prerequisite task.