Databricks

The Databricks target writes data to one or more Delta Lake tables on Databricks. For information about supported versions, see Supported systems and versions.

Use the Databricks target for the following use cases:

Bulk load new data into Delta Lake tables: Build a flow that bulk loads new data into Delta Lake tables on Databricks. When processing new data, the target uses the COPY command to load data into Delta Lake tables. For a detailed solution of how to design this flow, see Bulk loading data into a Delta Lake table.
Merge changed data into Delta Lake tables: Build a flow that reads change data capture (CDC) data from a database and replicates the changes to Delta Lake tables on Databricks. When processing CDC data, the target uses the MERGE command to load data into Delta Lake tables. For a detailed solution of how to design this flow, see Merging changed data into a Delta Lake table.

The Databricks target first stages the flow data in text files in Amazon S3, Azure Data Lake Storage Gen2, or Google Cloud Storage. Then, the target sends the COPY or MERGE command to Databricks to process the staged files.

The Databricks target uses a JDBC URL to connect to the Databricks cluster. When you configure the target, you specify the JDBC URL and credentials to use to connect to the cluster. You can also use a connection to configure the target. You define the information that the target uses to connect to the staging location.

You specify the tables in Delta Lake to write the data to. The target writes data from record fields to table columns based on matching names. You can configure the target to compensate for data drift by creating new columns in existing database tables when new fields appear in records or by creating new database tables. When the target performs data drift and creates new tables, you can configure it to partition the tables based on specified fields in the record.

You can configure the root field for the row, and any first-level fields that you want to exclude from the record. You can also configure the target to replace missing fields or fields with invalid data types with user-defined default values and to replace newline characters in string fields with a specified character. You can specify the quoting mode, define quote and escape characters, and configure the target to trim spaces.

The Databricks target can use CRUD operations defined in the sdc.operation.type record header attribute to write data. For information about Data Collector change data processing and a list of CDC-enabled sources, see Processing changed data.

Before you use the Databricks target, you must complete a prerequisite task.

Prerequisite

Before configuring the Databricks target, prepare your Databricks cluster.

In Databricks, configure and start your Databricks cluster, generate a personal access token, and locate the JDBC URL used to access the cluster.

For detailed prerequisite steps, see one of the following Databricks articles depending on your staging location:

When using Amazon S3 as the staging location, see this Databricks article.
When using Azure Data Lake Storage Gen2 as the staging location, see this Azure Databricks article.
When using Google Cloud Storage as the staging location, see this Databricks article.

Load methods

The Databricks target can load data to Delta Lake tables using the following methods:

COPY command for new data: The COPY command, the default load method, performs a bulk synchronous load to Delta Lake, treating all records as INSERTS. Use this method to write new data to Delta Lake tables.; When using the COPY command, the target first stages the flow data in text files in the specified staging location, then copies the staged data to the target Delta Lake tables.; Since the COPY command is the default load method, you do not need to configure the target to use this command.; For more information about the COPY command, see the Databricks documentation.
MERGE command for CDC data: Instead of treating all records as INSERT, the MERGE command inserts, updates, upserts, and deletes changed data to Delta Lake tables as appropriate. Use this method to write change data capture (CDC) data to Delta Lake tables using CRUD operations.; When using the MERGE command, the target first stages the flow data in text files in the specified staging location. Then the target runs the COPY command to load the data to a temporary Delta Lake table, and then finally runs a MERGE command that uses the temporary table to merge the changed data into the target Delta Lake table.; Important: To maintain the original order of data, do not use multiple threads or cluster execution mode when processing CDC data.; To use the MERGE command to load CDC data, select the Merge CDC Data property on the Data tab of the target. Then, enter the columns in the Delta Lake table to use as key columns.; For more information about the MERGE command, see the Databricks documentation.

Use the recommended guidelines to optimize for performance and cost-effectiveness.

CRUD operation processing

When you configure the Databricks target to use the MERGE command to load CDC data, the target can insert, update, upsert, or delete data.

When writing CDC data, the target uses the CRUD operation specified in the sdc.operation.type record header attribute. The target performs operations based on the following numeric values:

1 for INSERT
2 for DELETE
3 for UPDATE
4 for UPSERT

If your flow has a CRUD-enabled source that processes changed data, the target simply reads the operation type from the sdc.operation.type header attribute that the source generates. If your flow has a non-CDC source, you can use the Expression Evaluator processor or a scripting processor to define the record header attribute. For more information about Data Collector changed data processing and a list of CDC-enabled sources, see Processing changed data.

Primary key location

When you configure the Databricks target to use the MERGE command to load CDC data, you must provide primary key information.

You can configure the target to access primary key information in the following ways:

Use JDBC record header attribute: The target uses primary key columns stored in the jdbc.primaryKeySpecification record header attribute.; Several sources, such as the Oracle CDC Client source and the PostgreSQL CDC Client source, populate the record header attribute automatically. You can also use a processor, such as the Expression Evaluator, to populate the header attribute within the flow.
Use specified columns for each table: The target uses the primary key columns that you specify in the Table Key Columns properties.; Note: When key columns are known, entering them manually is more efficient than the other option.

Specifying tables

You can use the Databricks target to write to one or more tables. The target writes data from record fields to the table columns based on matching names.

Specify the tables to use based on how many tables you want to write to and the table location:

Single table on Databricks File System (DBFS)

To write data to a single table on DBFS, enter the name of the database and table using the following format:

<database_name>.<table_name>

For example, enter sales.accounts to write data to the accounts table in the sales database on DBFS.

To write to a table in the default delta database, enter the table name only.

Single external table

To write data to a single external table, enter the name of the catalog, schema, and table using the following format:

<catalog_name>.<schema_name>.<table_name>

For example, enter catalog.sales.accounts to write data to the accounts table using the sales schema in the catalog catalog.

Multiple tables

To write data to multiple tables, specify a field in the record that defines the database and tables to write to.

When the field contains the database and table name as follows, <database_name>.<table_name>, then each table is written to the specified database and table. When the field contains just the table name, each table is written to the default delta database.

For example, say you have tables named after departments in your company, such as Operations, Sales, and Marketing. The records being processed have a dept field with matching values. You configure the target to write records to the various tables using the following expression: ${record:value('/dept')}.

If the dept field contains just the table name, each table is written to the default delta database. If the dept field contains the database and table name, such as Canada.Operations, then each table is written to the specified database.

Or, say that you want to replicate data from multiple tables in a SQL Server database. You use a JDBC Multitable Consumer source which captures the name of each source table in the jdbc.tables record header attribute. To write data to tables based on the source table names, enter the following expression in the Table Name property: ${record:attribute('jdbc.tables')}.

When writing data to multiple tables, you might also increase the number of connections that the target uses. For more information, see Performance Optimization.

Use the Table Name property on the Databricks tab to specify the tables to write to.

Enabling data drift handling

The Databricks target can automatically compensate for changes in column or table requirements, also known as data drift.

The target can handle data drift in the following ways:

Create new columns

The target can create new columns in Delta Lake tables when new fields appear in records. For example, if a record suddenly includes a new Address2 field, the target creates a new Address2 column in the target table.

By default, the target creates new columns based on the data in the new fields, such as creating an Int column for integer data. You can, however, configure the target to create all new columns as String.

To enable the automatic creation of new columns, select the Enable Data Drift property on the Databricks tab. To create all new columns as String, select the Create New Columns as String property.

Create new tables

When data drift handling is enabled, you can also configure the target to create new tables as needed. For example, say the target writes data to tables based on the region name in the Region field. When a new SW-3 region shows up in a record, the target creates a new SW-3 table in Delta Lake and writes the record to the new table.

To enable the creation of new tables, first enable data drift, and then select the Auto Create Table property on the Databricks tab.

When creating a new table, you can optionally enter the directory for the Delta table location, specified as a relative path on Databricks File System (DBFS) or a URI path to an external location. Enter the location in the Table Location Path property.

Note: Tables used by Unity Catalog must be saved to an external location.

The target adds the specified Table Name value as a subdirectory of the Table Location Path to create the final table location. Here are some examples of table names, table location paths, and the final table locations:

DBFS - If you enter /mnt/deltalake as the table location path and you enter sales.accounts as the table name, the final table location is /mnt/deltalake/sales.accounts.
External - If you enter abfss://azure_container@account.dfs.core.windows.net/azure_sales/ as the table location path and you enter catalog.sales.accounts as the table name, the final table location is abfss://azure_container@account.dfs.core.windows.net/azure_sales/catalog.sales.accounts.

When you specify a location, the target creates an unmanaged Delta table. When you do not specify a location, the target creates a managed Delta table. For more information, see the Delta Lake documentation.

Partitioning tables

When you enable data drift handling and automatic table creation, you can configure the Databricks target to partition the tables and write to those partitions.

When partitioning tables, you specify the name of the table to partition and the name of the record field to become the partition column.

When appropriate, you can configure the target to use the same partition column for all created tables. To do this, use an asterisk ( * ) for the Table property. Then, the target uses the specified field as a partition column for all tables created by the target.

For example, the following configuration uses the country field as a partition in all tables, and the region field for the customers table:

Partition columns properties

Performance optimization

Use the following tips to optimize for performance and cost-effectiveness when using the Databricks target:

Increase the batch size

The maximum batch size is determined by the source in the flow and typically has a default value of 1,000 records. To take advantage of the Databricks loading abilities when writing to Delta Lake tables using the COPY or MERGE commands, increase the maximum batch size in the flow source to 20,000-50,000 records. Be sure to increase the Data Collector java heap size, as needed.

Important: Increasing the batch size is strongly recommended. Using the default batch size can be slow and costly.

Use multiple threads

When writing to Delta Lake tables using the COPY command, you can use multiple threads to improve performance by including a multithreaded source in the flow. When Data Collector resources allow, using multiple threads enables processing multiple batches of data concurrently. As with increasing the batch size, when using multiple threads, you should make sure that the Data Collector is sized appropriately.

Note: Do not use multiple threads to write CDC data to Delta Lake tables with the MERGE command. When using multiple threads to process data, the original order of the data is not retained.

Enable additional connections to Databricks

When writing to multiple Delta Lake tables using the COPY or MERGE commands, increase the number of connections that the Databricks target makes to Databricks. Each additional connection allows the target to write to an additional table, concurrently.

For example, when writing to 10 tables with only one connection, the target can only write to one table at a time. With 5 connections, the target can write to 5 tables at a time. 10 connections enables writing to all 10 tables at the same time.

By default, the target uses one connection for standard single-threaded flows. In multithreaded flows, the target matches the number of threads used by the flow. That is, when a multithreaded source is configured to use up to 3 threads, then by default, the Databricks target uses 3 connections to write to Delta Lake tables, one for each thread.

Note that the number of connections is for the entire flow, not for each thread. So when using multiple threads to write to multiple tables, you can also improve performance by allocating additional connections. For example, when using 3 threads to write to 3 tables, you might increase the number of connections to 9 for maximum throughput.

Use the Connection Pool Size property to specify the maximum number of connections that the target can use.

Staging location

The Databricks target first stages the flow data in text files in a specified location. Then, the target sends the COPY or MERGE command to Databricks to process the staged files.

The target can stage data in one of the following locations:

Amazon S3: After selecting Amazon S3 as the staging location, specify the existing S3 bucket to stage the files to. You also specify the credentials that the target uses to connect to Amazon S3.
ADLS Gen2: After selecting ADLS Gen2 as the staging location, specify the name of the existing Azure account and storage container to stage the files to. You then configure the target to use the appropriate authentication method to connect to Azure Data Lake Storage Gen2.
Google Cloud Storage: After selecting Google Cloud Storage as the staging location, specify the project ID and an existing Google Cloud Storage bucket to use. You also define credential provider details.

For all staging locations, you specify the stage file name prefix and whether the target removes a staged file after its data is written to Delta Lake tables.

Amazon S3 credentials

When you configure the target to connect to an Amazon S3 staging location, the target must pass credentials to Amazon Web Services.

Use one of the following methods to pass AWS credentials:

Instance profile: When Data Collector runs on an Amazon EC2 instance that has an associated instance profile, Data Collector uses the instance profile credentials to automatically authenticate with AWS.; To use an instance profile, on the Staging tab, set the Authentication Method property to Instance Profile.; For more information about associating an instance profile with an EC2 instance, see the Amazon EC2 documentation.
AWS access key pair: When Data Collector does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you can connect with an AWS access key pair.

To connect with an AWS access key pair, on the Staging tab, set the Authentication Method property to AWS Keys. Then define the access key ID and secret access key.
Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores.

ADLS Gen2 authentication information

When you configure the target to connect to an ADLS Gen2 staging location, you select the authentication method that the target uses to connect to Azure Data Lake Storage Gen2.

Select one of the following authentication methods:

OAuth 2.0

Connections made with OAuth 2.0 authentication require the following information:

Application ID - Application ID for the Azure Active Directory Data Collector application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
Application Key - Authentication key or client secret for the Azure Active Directory application. Also known as the client secret.
For information on accessing the application key from the Azure portal, see the Azure documentation.
Auth Token Endpoint - OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Data Collector. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.

Shared Key

Connections made with Shared Key authentication require the following information:

Account Shared Key - Shared access key that Azure generated for the storage account.
For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

Google Cloud credentials

Before staging data on Google Cloud Storage, the Databricks target must pass credentials to Google Cloud.

You can provide credentials using one the following options:

Google Cloud default credentials
Credentials in a file
Credentials in a stage property

For details on how to configure each option, see Security in Google Cloud stages.

Row generation

When writing a record to a table, the Databricks target includes all record fields in the resulting row, by default. The target uses the root field, /, as the basis for the resulting row.

You can configure the Row Field property on the Data tab to specify a map or list-map field in the record as the basis for the row. The resulting record includes only the data from the specified map or list-map field and excludes all other record data. Use this functionality when the data that you want to write to Delta Lake tables exists in a single map or list-map field within the record.

When you want to use the root field, but do not want to include all fields in the resulting row, configure the target to ignore all specified first-level fields.

The Databricks target cannot convert map or list-map fields nested within the specified root field. The target treats these fields as having an invalid data type. For example, say that you configure the target to use the root field, /, as the basis for the resulting row. A record contains the following fields:

{
    "name": "Jane Smith",
    "id": "557",
    "address": {
        "street": "101 3rd St",
        "city": "Huntsville",
        "state": "NC",
        "zipcode": "27023"
     }
}

The target treats the address map field as a field with an invalid data type, processing the field as an error record by default. You can configure the target to ignore the field and process the remaining record data, as described in Missing fields and fields with invalid types.

Tip: To write all fields in a record including nested map or list-map fields, use a Field Flattener processor in the flow to flatten the entire record to produce a record with no nested fields.

Missing fields and fields with invalid types

By default, the target treats records with missing fields or with invalid data types in fields as error records.

You can configure the target to ignore missing fields or fields with invalid data types, replacing the data in the field with an empty value.

The default for each data type is \N, which represents an empty value in Delta Lake. You can specify a different default value to use for each data type on the Data Defaults tab. For example, you might define the default value for a missing String field or a String field with an invalid data type as none or not_applicable.

To configure the target to ignore missing fields and fields with invalid data types, select the Ignore Missing Fields and the Ignore Fields with Invalid Types properties on the Data tab.

Databricks data types

The Databricks target converts Data Collector data types into Databricks data types before writing data to Delta Lake tables.

When you configure the target to compensate for data drift, you can also configure the target to create all new columns as String. However, by default, the target converts record data to the appropriate data type.

The target does not support nested Data Collector data types: List, List-Map, and Map. By default, the target treats fields with invalid data types as an error record. You can configure the target to ignore fields with invalid data types, as described in Missing fields and fields with invalid types.

Tip: To process fields in a record including nested list, list-map or map fields, use a Field Flattener processor to flatten the entire record to produce a record with no nested fields.

The target converts the following Data Collector data types into these Databricks data types:

Data Collector Data Type	Databricks Data Type
Boolean	Boolean
Byte	Tinyint
Byte_Array	Binary
Char	String
Date	Date
Datetime	Timestamp
Decimal	Decimal
Double	Double
Float	Float
Integer	Int
Long	Bigint
Short	Smallint
String	String
Time	Timestamp
Zoned_Datetime	Date

Configuring a Databricks target

About this task

Configure a Databricks target to write data to one or more Delta Lake tables on Databricks. Before you configure the target, be sure to complete the prerequisite task.

Procedure

In the Properties panel, on the General tab, configure the following properties:

General Property	Description
Name	Stage name.
Description	Optional description.
Required Fields	Fields that must include data for the record to be passed into the stage. Tip: You might include fields that the stage uses. Records that do not include all required fields are processed based on the error handling configured for the flow.
Preconditions	Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions. Records that do not meet all preconditions are processed based on the error handling configured for the stage.
On Record Error	Error record handling for the stage: Discard - Discards the record. Send to Error - Sends the record to the flow for error handling. Stop Flow - Stops the flow.

On the Databricks tab, configure the following properties:

Databricks Property	Description
Connection 7.4 and later	Connection that defines the information that is required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can specify the details in local properties. When you select a connection, the flow canvas hides properties that are defined in the connection.
JDBC URL	JDBC URL used to connect to the Databricks cluster. Enter in the following format: `jdbc:databricks://<server_hostname>:443/default;transportMode=http` `:ssl=1;httpPath=sql/protocolv1/o/0/xxxx-xxxxxx-xxxxxxxx;AuthMech=3;` Tip: In Databricks, you can locate the JDBC URL for your cluster on the JDBC/ODBC tab in the cluster configuration details. As a best practice, remove the `PWD` parameter from the URL, and then enter the personal access token value in the Token property below.
Token	Personal access token used to connect to the Databricks cluster. Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores.
Fully Qualified Table Name	Table to write data to. Enter the table name in one of the following formats for the table name, depending on the table location: DBFS - `<database_name>.<table_name>` External - `<catalog_name>.<schema_name>.<table_name>` Note: The catalog and schema for external tables must be created before you can run the flow. When you specify a table name only, the target locates the table in the default `delta` database. To write data to a single table, enter the table name. To write data to multiple tables, enter an expression that evaluates to the field in the record that contains the table name. For example: `${record:value('/table')}` Or, to write data to tables based on the table name in the `jdbc.tables` record header attribute generated by a source that reads from a relational database, you can use the following expression: `${record:attribute('jdbc.tables')}`
Connection Pool Size	Maximum number of connections that the target uses to write data to Delta Lake tables. Default is 0, which ensures that the target uses the same number of connections as threads used by the flow. When writing to multiple tables, increasing this property can improve performance.
Enable Data Drift	Creates new columns in existing tables when new fields appear in records. When not enabled, the target treats records that have new fields as error records.
Auto Create Table	Automatically creates tables in Delta Lake if the tables specified in the target do not exist. When not enabled, the target treats a record attempting to write to a table that doesn't exist as an error record. Available when data drift is enabled.
Table Location Path	Directory for the Delta table location, specified as a path on Databricks File System (DBFS) or a URI path to an external location. Note: Tables used by Unity Catalog must be saved to an external location. The target adds the specified Table Name value as a subdirectory of the Table Location Path to create the final table location. Here are some examples of table names, table location paths, and the final table locations: DBFS - If you enter /mnt/deltalake as the table location path and you enter `sales.accounts` as the table name, the final table location is /mnt/deltalake/sales.accounts. External - If you enter `abfss://azure_container@account.dfs.core.windows.net/azure_sales/` as the table location path and you enter `catalog.sales.accounts` as the table name, the final table location is `abfss://azure_container@account.dfs.core.windows.net/azure_sales/catalog.sales.accounts`. When you specify a location, the target creates an unmanaged Delta table. When you do not specify a location, the target creates a managed Delta table. For more information, see the Delta Lake documentation. Available when data drift and automatic table creation are enabled.
Partition Table	Enables partitioning a table. Available when data drift and automatic table creation are enabled.
Partition Columns	Partition columns to use. Specify the following information: Table - Name of the table to create the partition column for. Use an asterisk ( * ) to represent all tables. Column - Name of the field in the record to use as a partition column. Available when data drift and automatic table creation are enabled.
Create New Columns as String	Creates all new columns as String. By default, the target infers the data type based on the type of data in the field. For example, if a new field contains an integer, the target creates a column with the IntegerType data type. Available when data drift is enabled.
Queries per Second	Maximum number of queries to run in a second across all partitions and tables. Use 0 for no limit. Default is 0.
Query Maximum Wait Time (secs)	Maximum number of seconds to wait for a query to finish before timing out. Use 0 for no limit. Default is 0.
Maximum Number of Retries	Maximum number of times to retry a query. Default is 1.
Retry Interval (secs)	Time between query retries in seconds. Default is 10.

On the Staging tab, configure the following properties:

Staging Property	Description
Staging Location	Staging location to connect to and copy or merge data from: Amazon S3 - Connects to an Amazon S3 staging location. ADLS Gen2 - Connects to an Azure Data Lake Storage Gen2 staging location. Google Cloud Storage - Connects to a Google Cloud Storage staging location.
Amazon S3 Connection 7.0 and later	Connection that defines the information that is required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can specify the details in local properties. When you select a connection, the flow canvas hides properties that are defined in the connection.
Authentication Method	Amazon authentication method: AWS Keys Instance Profile - Uses the instance profile assigned to the EC2 instance where Data Collector runs to connect to Amazon S3. None Available when using the Amazon S3 staging location.
Access Key ID	AWS access key ID. Required when not using an instance profile. Available when using the Amazon S3 staging location.
Secret Access Key	AWS secret access key. Required when not using an instance profile. Available when using the Amazon S3 staging location. Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores.
Bucket	Name of the existing bucket to write the staged files to. Available when using the Amazon S3 or Google Cloud Storage staging location.
Azure Blob Storage Connection 7.0 and later	Connection that defines the information that is required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can specify the details in local properties. When you select a connection, the flow canvas hides properties that are defined in the connection.
Account FQDN	Fully qualified domain name of the Azure storage account. The target can access Azure Data Lake Storage Gen2 or Blob storage accounts in either global Azure or Azure Government services. For example, valid domain names for global Azure services include `<account-name>.dfs.core.windows.net` or `<account-name>.blob.core.windows.net` Valid domain names for Azure Government services include `<account-name>.dfs.core.usgovcloudapi.net` or `<account-name>.blob.core.usgovcloudapi.net` Available when using the ADLS Gen2 staging location.
Storage Container / File System	Name of the existing Azure storage container or file system to stage the files to. Available when using the ADLS Gen2 staging location.
Azure Authentication Method	Authentication method used to connect to Azure: OAuth 2.0 Shared Key Available when using the ADLS Gen2 staging location.
Application ID	Application ID for the Azure Active Directory Data Collector application. Also known as the client ID. For information on accessing the application ID from the Azure portal, see the Azure documentation. Available when using the OAuth 2.0 authentication method for Azure.
Application Key	Authentication key or client secret for the Azure Active Directory application. Also known as the client secret. For information on accessing the application ID from the Azure portal, see the Azure documentation. Available when using the OAuth 2.0 authentication method for Azure.
Auth Token Endpoint	OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Data Collector. For example: `https://login.microsoftonline.com/<uuid>/oauth2/token.` Available when using the OAuth 2.0 authentication method for Azure.
Account Shared Key	Shared access key that Azure generated for the storage account. For more information on accessing the shared access key from the Azure portal, see the Azure documentation. Available when using the Shared Key authentication method for Azure.
Google Cloud Storage Connection 7.0 and later	Connection that defines the information that is required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can specify the details in local properties. When you select a connection, the flow canvas hides properties that are defined in the connection.
Project ID	Google Cloud project ID to use.
Credentials Provider	Provider for Google Cloud credentials: Default credentials provider - Uses Google Cloud default credentials. Service account credentials file (JSON) - Uses credentials stored in a JSON service account credentials file. Service account credentials (JSON) - Uses JSON-formatted credentials information from a service account credentials file.
Credentials File Path (JSON)	Path to the Google Cloud service account credentials file used to connect. The credentials file must be a JSON file. Enter a path relative to the Data Collector resources directory, `$SDC_RESOURCES`, or enter an absolute path. Available when Credentials Provider is set to Service Account Credentials File (JSON).
Credentials File Content (JSON)	Contents of a Google Cloud service account credentials JSON file used to connect. Enter JSON-formatted credential information in plain text, or use an expression to call the information from runtime resources or a credential store. Available when Credentials Provider is set to Service Account Credentials (JSON).
Purge Stage File After Ingesting	Purges a staged file after its data is successfully written to a Delta Lake table.
Stage Filestorag Prefix	Prefix to use for the name of the staged files. To create a unique name for each staged file, the target appends a unique number to the prefix. For example, if you define the prefix as `sdc`, the target might name one of the staged files `sdc-69021a22-1474-4926-856f-eb4589d14cca`. You can optionally use the forward slash (/) to define a path to the staged files in the Amazon S3 or Google Cloud Storage bucket or the Azure storage container or file system. For example, if staging to Amazon S3 and you define the bucket as `staging` and the stage file prefix as `/data/sdc`, the target writes the staged files to the following location: `staging/data/sdc-<unique number>`
Specify AWS Region	Enables specifying an AWS region to write to. Available when staging data on Amazon S3.
AWS Region	AWS region to use. Available when staging data on Amazon S3.

When using Amazon S3 as the staging location, on the Staging Advanced tab, configure the following properties:

Staging Advanced Property	Description
S3 Connection Timeout	Seconds to wait for a response before closing the connection.
S3 Socket Timeout	Seconds to wait for a response to a query.
S3 Max Error Retry	Maximum number of times to retry requests.
S3 Uploading Threads	Size of the thread pool for parallel uploads. Used when working with multiple partitions and processing large objects in multiple parts. When working with multiple partitions, setting this property up to the number of partitions being used to can improve performance. For more information about this and the following properties, see the Amazon S3 TransferManager documentation.
S3 Minimum Upload Part Size (MB)	Minimum part size in bytes for multipart uploads.
S3 Multipart Upload Threshold (MB)	Minimum batch size in bytes for multipart uploads.
S3 Proxy Enabled	Specifies whether to use a proxy to connect.
S3 Proxy Host	Proxy host.
S3 Proxy Port	Proxy port.
S3 Proxy Authentication Enabled	Indicates that proxy authentication is used.
S3 Proxy User	S3 proxy user.
S3 Proxy Password	S3 proxy password.
S3 Encryption	Option that Amazon S3 uses to manage the encryption keys: None SSE-S3 - Use Amazon S3-managed keys. SSE-KMS - Use Amazon Web Services KMS-managed keys. Default is None.
S3 Encryption KMS ID	Amazon resource name (ARN) of the AWS KMS encryption key. Use the following format: `<arn>:<aws>:<kms>:<region>:<acct ID>:<key>/<key ID>` Used for SSE-KMS encryption only.
S3 Encryption Context	Key-value pairs to use for the encryption context. Click Add to add key-value pairs. Used for SSE-KMS encryption only.

On the Data tab, configure the following properties:

Data Property	Description
Staging File Format	Format for staging files. Select one of the following options: Delimited Avro Parquet
Replace Newlines	Replaces newline characters in string fields with the specified replacement character. Available when using the Delimited staging file format.
Newline Replacement Character	Character to use to replace newline characters. Available when using the Delimited staging file format.
Trim Spaces	Trims leading and trailing spaces from field data. Available when using the Delimited staging file format.
Column Separator	Character to use as a column separator. Available when using the Delimited staging file format.
Quoting Mode	Mode for handling special characters in the data, such as the column separator and newline character: Quoted - Encloses data in each field with the specified quote character. The following example uses asterisks to enclose the data in a field: `string data, more string data` Escaped - Precedes a special character with the specified escape character. The following example uses a backtick to escape the comma column separator in a field: string data`, more string data Both - Performs both tasks described above. Available when using the Delimited staging file format.
Quote Character	Character to enclose field data. Available when Quoting Mode is set to Quoted or Both.
Escape Character	Character to precede special characters in field data. Available when Quoting Mode is set to Escaped or Both.
Row Field	Map or list-map field to use as the basis for the generated row. Default is `/`, which includes all record fields in the resulting row.
Null Value	Characters in the data that represent a null value. Default is `\N`.
Merge CDC Data	Uses the MERGE command to perform CRUD operations when writing to Delta Lake tables. Select to process CDC data. Important: To maintain the original order of data, do not use multiple threads or cluster execution mode when processing CDC data. For more information about the MERGE command, see Load methods. For information about optimizing flow performance, see Performance optimization.
Primary Key Location	Location of primary key details: JDBC Header - Use when primary key columns are defined in the `jdbc.primaryKeySpecification` record header attribute. Specify for Each Table - Use to list each target table and the primary key columns for the table. Note: When key columns are known, entering them manually is more efficient than the other option.
Table Key Columns	Key columns in each Delta Lake table used to evaluate the MERGE condition. Click the Add icon to add additional tables. Click the Add icon in the Key Columns field to add additional key columns for a table. Available when Primary Key Location is set to Specify for Each Table.
Ignore Missing Fields	Allows writing records with missing fields to tables. Uses the specified default value for the data type of the missing field. When not enabled, records with missing fields are treated as error records.
Ignore Fields with Invalid Types	Allows replacing fields that contain data of an invalid type with the specified default value for the data type. When not enabled, records with data of invalid types are treated as error records.
Data Storage Time Zone	Time zone to convert Date, Datetime, and Time fields to before storing them.
Column fields to ignore 7.0 and later	Fields to ignore from records or tables when writing to the target. To add fields, click Add value, click Edit, and then configure the following options: Ignore fields from - Choose whether to ignore fields from incoming records or from each table. Columns - Enter the column or field names to ignore. Names are case-sensitive. Table - When ignoring fields from tables, enter the table name. The name is case-sensitive.
Column Fields to Ignore 6.4 and earlier	List of fields to ignore when writing to the Delta Lake table. You can enter a comma-separated list of first level fields to ignore.

On the Data Defaults tab, configure the following properties:

Data Defaults Property	Description
Data defaults for missing or invalid fields 7.4 and later	Default values to use when a field is missing or contains invalid data. If you do not define a default value, the target writes missing or invalid fields as `\N`, which represents an empty value in Delta Lake. To define default values, click Add value, click Edit, and then select the data type and enter the default value.
<data type> Default 7.3 and earlier	Default value to use when replacing missing fields or fields with invalid data for the listed data type. Default is `\N`, which represents an empty value in Delta Lake.