Azure Synapse SQL
The Azure Synapse SQL destination loads data into one or more tables in Microsoft Azure Synapse. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
To load data, the destination first stages the pipeline data in CSV files in a staging area, either Azure Blob Storage or Azure Data Lake Storage Gen2. Then, the destination loads the staged data into Azure Synapse.
The following image displays the steps that the destination completes to load the data:
When you configure the Azure Synapse SQL destination, you specify the instance type to connect to. You configure authentication information for the instance and for your Azure staging area. You can also use a connection to configure the destination.
You can use the Azure Synapse SQL destination to write new data or change data capture (CDC) data to Azure Synapse. When processing new data, the destination loads data to Azure Synapse using the COPY command. When processing CDC data, the destination uses the MERGE command. When using the COPY load method, you can also specify copy statement authentication details for bulk loading to Azure Synapse.
You specify the name of the schema and tables to load the data into. The destination writes data from record fields to table columns based on matching names.
You can configure the destination to compensate for data drift by creating new columns in existing database tables when new fields appear in records or by creating new database tables as needed. When creating new tables, you can specify a partition column and partition boundary values to use.
You can configure the root field for the row, and any first-level fields that you want to exclude from the record. You can specify characters to represent null values.
You can configure the destination to replace missing fields or fields with invalid data types with user-defined default values. You can also configure the destination to replace newline characters and trim leading and trailing spaces.
Before you use the Azure Synapse SQL destination, you must complete some prerequisite tasks.
Prerequisites
Prepare the Azure Synapse Instance
Before configuring the destination, prepare your Azure Synapse instance.
- If necessary, create a database in Azure Synapse.
- If necessary, create a schema within the database.
You can use the default schema named
dbo
or any other user-defined schema. - If necessary, create tables within the schema.
If the tables defined in the destination do not exist, the destination can create new tables in the schema. You can configure the destination to load data into existing tables only, as described in Specifying Tables.
- Set up a user that can connect to Azure Synapse using SQL Login or Azure Active
Directory password authentication.
For more information about these authentication methods, see the Azure documentation.
- Grant the user the required permissions, based on whether you are loading new or
changed data:
- When using the COPY command to load new data, grant the following
permissions:
- INSERT
- ADMINISTER DATABASE BULK OPERATIONS
- When using the MERGE command to load changed data, grant the following
permissions:
- INSERT
- UPDATE
- DELETE
For either load method, if you want the destination to handle data drift automatically, grant the user the additional permission:- ALTER TABLE
For either load method, if you want the destination to create tables automatically while handling data drift, grant the user the additional permission:- CREATE TABLE
- When using the COPY command to load new data, grant the following
permissions:
Prepare the Azure Storage Staging Area
Before configuring the destination, prepare a staging area in Azure Blob Storage or Azure Data Lake Storage Gen2. The destination stages CSV files in the staging area before loading them to Azure Synapse.
- If necessary, create an Azure storage account in Azure Blob Storage or Azure Data
Lake Storage Gen2.
For information about creating an account, see the Azure documentation.
- If necessary, create a container to act as the staging area for the destination.
Azure Data Lake Storage Gen2 refers to storage as either a file system or container.
For information about creating storage, see the Azure documentation.
- If you plan to use Azure Active Directory with Service Principal authentication to
connect to the staging area, complete the following steps. If you plan to use
Storage Account Key authentication to connect to the staging area, you can skip
these steps:
- If necessary, create a new Azure Active Directory application for Data Collector.
If the storage account already has an existing Azure Active Directory application, you can use the existing application for Data Collector.
For information about creating a new application, see the Azure documentation.
- Grant the application the Storage Blob Data Contributor or Storage Blob Data
Owner role.
For information about configuring access control, see the Azure documentation.
- If necessary, create a new Azure Active Directory application for Data Collector.
Enable Access to the Container
The Azure Synapse SQL destination loads files from the staging area to Azure Synapse.
- Staging connection
- By default, the destination uses the authentication defined in the staging
connection to access the container and perform the load. The
access required for the connection depends on the authentication method that
you use:
- Azure Active Directory with Service Principal - The minimum required RBAC roles are Storage Blob Data Contributor, Storage Blob Data Owner, or Storage Blob Data Reader.
- Storage Account Key - No permissions are required.
- Copy statement connection
- When using the COPY load method and you enable the use of a copy
statement connection, the destination uses the authentication
defined in the copy statement connection to connect to the container and
issue the COPY command. The access required for the connection depends on
the authentication method that you use:
- Azure Active Directory User - The minimum required RBAC roles are Storage Blob Data Contributor or Storage Blob Data Owner for the storage account.
- Azure Active Directory with Service Principal - The minimum required RBAC roles are Storage Blob Data Contributor, Storage Blob Data Owner, or Storage Blob Data Reader.
- Managed Identity - The minimum required RBAC roles are Storage Blob Data Contributor or Storage Blob Data Owner for the AAD-registered SQL database server.
- Shared Access Signature (SAS) - The minimum required permissions are Read and List.
- Storage Account Key - No permissions are required.
Load Methods
The Azure Synapse SQL destination can load data to Azure Synapse using the following methods:
- COPY command for new data
- The COPY command, the default load method, performs a bulk synchronous load from the staging area to Azure Synapse, treating all records as INSERTS. Use this method to write new data to Azure Synapse tables.
- MERGE command for CDC data
- Like the COPY command, the MERGE command performs a bulk synchronous load from the staging area to Azure Synapse. But instead of treating all records as INSERT, it inserts, updates, upserts, and deletes records as appropriate. Use this method to write change data capture (CDC) data to Azure Synapse tables using CRUD operations.
Connections and Authentication
- Azure Synapse - Determines how the destination connects to Azure Synapse.
- Staging - Determines how the destination connects to the staging area.
- Copy Statement - Optional authentication when using the COPY load method to bulk load data from the staging area to Azure Synapse. When not defined, the destination uses the authentication defined for the staging connection.
Azure Synapse Connection
The Azure Synapse SQL destination requires connection details to connect to your Azure Synapse instance.
On the Azure Synapse SQL tab, you specify the name of the Azure Synapse server and the database to connect to.
- SQL Login
- The destination connects to Azure Synapse using a user name and password created in the database.
- Azure Active Directory
- The destination connects to Azure Synapse using an Azure Active Directory
user identity and password. The user identity and
password are the credentials used to sign in to the Azure
portal. Enter the full user identity, for example:
user@mycompany.com
.
The following image displays a destination configured to connect to the
my_db
database in the my-synapse
server using the
SQL Login authentication method:
Staging Connection
The Azure Synapse SQL destination requires connection details to connect to the storage account that stages pipeline data. You can use Azure Blob Storage or Azure Data Lake Gen2 as the staging area.
On the Azure Staging tab, you specify the name of the storage account and the name of the container or file system to use.
- Azure Active Directory with Service Principal
- The destination connects to the staging area using the following
information:
- Application ID - Application ID for the Azure Active Directory Data Collector
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Tenant ID - Tenant ID for the Azure Active Directory Data Collector application.
- Application Key - Authentication key or client secret
for the Azure Active Directory application. Also known as the
client secret.
For information on accessing the application key from the Azure portal, see the Azure documentation.
- Application ID - Application ID for the Azure Active Directory Data Collector
application. Also known as the client ID.
- Storage Account Key
- The destination connects to the staging area using the following
information:
- Storage Account Key - Shared access key that Azure
generated for the storage account.
For more information on accessing the shared access key from the Azure portal, see the Azure documentation.
- Storage Account Key - Shared access key that Azure
generated for the storage account.
The following image displays as destination configured to connect to the
my_files
file system in the my_storage
Azure Data
Lake Storage Gen2 storage account using the Azure Active Directory with Service
Principal authentication method:
Copy Statement Connection
When using the COPY load method, the Azure Synapse SQL destination creates a connection to issue a COPY command that reads from the staging area and bulk loads the data into Azure Synapse.
The destination uses the connection information specified for the Azure Synapse and staging connections for the copy statement connection. That is, the copy statement connection uses configuration details for the Azure Synapse server and database that you define for the Azure Synapse connection. The copy statement connection also uses configuration details for the staging storage account and container that you define for the staging connection.
By default, the copy statement connection also uses the staging authentication method and configuration details that you define for the staging connection. When needed, you can specify authentication details specifically for the copy statement connection, instead of reusing the staging authentication details.
On the Azure Staging tab, you can enable the use of copy statement authentication and specify the authentication method and details to use.
- Azure Active Directory User
- Can be used only when the Azure Synapse connection uses Azure Active Directory authentication.
- Azure Active Directory with Service Principal
- The destination connects from Azure Synapse to the staging area using the
following information:
- Application ID - Application ID for the Azure Active Directory Data Collector
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Tenant ID - Tenant ID for the Azure Active Directory Data Collector application.
- Application Key - Authentication key or client secret
for the Azure Active Directory application. Also known as the
client secret.
For information on accessing the application key from the Azure portal, see the Azure documentation.
- Application ID - Application ID for the Azure Active Directory Data Collector
application. Also known as the client ID.
- Managed Identity
- The destination connects from Azure Synapse to the staging area using a managed identity. You can use this authentication method when your storage account is attached to a VNet.
- Shared Access Signature (SAS)
- The destination connects from Azure Synapse to the staging area using an SAS
token. The SAS token must be configured to allow all permissions and the
HTTPS protocol.
You can create the SAS token using the Azure portal by selecting Shared Access Signature from Settings in the storage account menu. Or you can create the SAS token using the Azure CLI as described in the Azure documentation.
Copy and save the generated token so that you can use it to configure the destination.
The minimum required permissions are Read and List.
- The destination connects from Azure Synapse to the staging area using the
following information:
- Account Shared Key - Shared access key that Azure
generated for the storage account.
For more information on accessing the shared access key from the Azure portal, see the Azure documentation.
- Account Shared Key - Shared access key that Azure
generated for the storage account.
The following image displays a destination configured to use the Managed Identity authentication method for copy statement authentication:
The destination also uses a staging connection that connects to the
my_files
file system in the my_storage
Azure Data
Lake Storage Gen2 storage account using Storage Account Key authentication. If you did
not configure copy statement authentication, the destination would use the specified
storage account key for copy statement authentication.
Specifying Tables
You can use the Azure Synapse SQL destination to write to one or more tables in a schema. The Azure Synapse SQL destination can load data into a single existing schema.
- Single table
- To write data to a single table, simply enter table name as
follows:
<table_name>
- Multiple tables
- To write data to multiple tables, specify a field in the record that defines the tables.
Use the Table property on the Table Definition tab to specify the tables to write to.
Enabling Data Drift Handling
The Azure Synapse SQL destination can automatically compensate for changes in column or table requirements, also known as data drift.
- Create new columns
- The destination can create new columns in existing tables when new fields
appear in records. For example, if a record suddenly includes a new
Address2
field, the destination creates a newAddress2
column in the target table. - Create new tables
- When data drift handling is enabled, you can also configure the destination
to create new tables as needed. For example, say the destination writes data
to tables based on the region name in the
Region
field. When a newSW-3
region shows up in a record, the destination creates a newSW-3
table and writes the record to the new table.
Row Generation
When loading a record to a table, the Azure Synapse SQL destination includes all
record fields in the resulting row, by default. The destination uses the root field,
/
, as the basis for the resulting row.
You can configure the Row Field property on the Data tab to specify a map or list-map field in the record as the basis for the row. The resulting record includes only the data from the specified map or list-map field and excludes all other record data. Use this functionality when the data that you want to load to Azure Synapse exists in a single map or list-map field within the record.
If you want to use the root field, but do not want to include all fields in the resulting row, you can configure the destination to ignore all specified first-level fields.
/
, as the basis for the resulting row. A record contains the
following
fields:{
"name": "Jane Smith",
"id": "557",
"address": {
"street": "101 3rd St",
"city": "Huntsville",
"state": "NC",
"zipcode": "27023"
}
}
The destination treats the address
map field as a field with an invalid
data type, ignoring the field by default and processing the remaining record data. You
can configure the destination to treat fields with invalid data types as an error
record, as described in Missing Fields and Fields with Invalid Types.
Missing Fields and Fields with Invalid Types
By default, the destination ignores missing fields or fields with invalid data types, replacing the data in the field with a null value.
The default for each data type is an empty string, which represents a null value in Azure
Synapse. You can specify a different default value to use for each data type on the Data
tab. For example, you might define the default value for a missing Varchar field or a
Varchar field with an invalid data type as none
or
not_applicable
.
You can configure the destination to treat records with missing fields or with invalid data types in fields as error records. To do so, clear the Ignore Missing Fields and the Ignore Fields with Invalid Types properties on the Data tab.
Performance Optimization
Use the following tips to optimize for performance and cost-effectiveness when using the Azure Synapse SQL destination:
- Increase the batch size
- The maximum batch size is determined by the origin in the pipeline and typically has a default value of 1,000 records. To take advantage of the loading abilities that Azure Synapse provides, increase the maximum batch size in the pipeline origin to 20,000-50,000 records. Be sure to increase the Data Collector java heap size, as needed. For more information, see Java Heap Size in the Data Collector documentation.
- Use multiple threads
- You can use multiple threads to improve performance when you include a multithreaded origin in the pipeline. When Data Collector resources allow, using multiple threads enables processing multiple batches of data concurrently.
- Enable additional connections to Azure Synapse
- When loading data into multiple tables, increase the number of connections that the destination makes to Azure Synapse. Each additional connection allows the destination to load data into an additional table, concurrently.
Azure Synapse Data Types
The Azure Synapse SQL destination converts Data Collector data types into Azure Synapse data types before loading data into Azure.
When you configure the destination to compensate for data drift, you can also configure the destination to create all new columns as Varchar. However, by default, the destination converts record data to the appropriate data type.
Data Collector Data Type | Azure Synapse Data Type |
---|---|
Boolean | Bigint, Bit, Char, Decimal, Float, Int, Nchar, Numeric, Nvarchar,
Real, Smallint, Tinyint, or Varchar. New columns are created as Bit. |
Byte | Bigint, Char, Decimal, Float, Int, Nchar, Numeric, Nvarchar,
Real, Smallint, Tinyint, or Varchar. New columns are created as Smallint. |
Byte_Array | Binary, Char, Nchar, Nvarchar, Varbinary, or Varchar. New columns are created as Binary. |
Date | Char, Date, Nchar, Nvarchar, or Varchar. New columns are created as Date. |
Datetime | Char, Date, Datetime, Datetime2, Nchar, Nvarchar, Smalldatetime,
Time, or Varchar. New columns are created as Datetime. |
Decimal | Char, Decimal, Money, Nchar, Numeric, Nvarchar, Smallmoney, or
Varchar. New columns are created as Numeric. |
Double | Char, Decimal, Float, Money, Nchar, Numeric, Nvarchar,
Smallmoney, or Varchar. New columns are created as Float. |
Float | Char, Decimal, Float, Money, Nchar, Numeric, Nvarchar, Real,
Smallmoney, or Varchar. Note: Converting to Decimal or Numeric can
result in some precision loss.
New columns are created as Real. |
Integer | Bigint, Char, Decimal, Float, Int, Money, Nchar, Numeric,
Nvarchar, Real, Smallmoney, or Varchar. New columns are created as Int. |
Long | Bigint, Char, Decimal, Float, Money, Nchar, Numeric, Nvarchar,
Smallmoney, or Varchar. New columns are created as Bigint. |
Short | Bigint, Char, Decimal, Float, Int, Money, Nchar, Numeric,
Nvarchar, Real, Smallint, Smallmoney, Tinyint, or Varchar. New columns are created as Smallint. |
String | Char, Datetimeoffset, Nchar, Nvarchar, or Varchar. New columns are created as Varchar. |
Time | Char, Nchar, Nvarchar, Time, or Varchar. New columns are created as Time. |
Zoned_Datetime | Char, Nchar, Nvarchar, or Varchar. |
- Geography
- Geometry
CRUD Operation Processing
The Azure Synapse SQL destination can insert, update, upsert, or delete data when you configure the destination to process CDC data. When processing CDC data, the destination uses the MERGE command to write data to Azure Synapse.
sdc.operation.type
record header attribute. The destination
performs operations based on the following numeric values:- 1 for INSERT
- 2 for DELETE
- 3 for UPDATE
- 4 for UPSERT
If
your pipeline has a CRUD-enabled origin that processes
changed data, the destination simply reads the operation
type from the sdc.operation.type
header
attribute that the origin generates. If your pipeline has
a non-CDC origin, you can use the Expression Evaluator
processor or a scripting processor to define the record
header attribute. For more information about Data Collector changed data processing and a list of CDC-enabled
origins, see Processing Changed Data.
Configuring an Azure Synapse SQL Destination
Configure an Azure Synapse SQL destination to load data into Azure Synapse. Be sure to complete the necessary prerequisites before you configure the destination.