Snowflake Bulk
The Snowflake Bulk origin reads all available data from one or more tables in a Snowflake database, then stops the pipeline. This design enables processing Snowflake data without incurring the costs of continuously connecting to Snowflake with streaming pipelines.
You can use the Snowflake Bulk origin to read from any Snowflake database hosted on Amazon S3, Google Cloud Storage, Microsoft Azure, and private Snowflake installations. For information about supported versions, see Supported Systems and Versions.
During the read process, the Snowflake Bulk origin sends a command to Snowflake to stage data as CSV files on either an internal Snowflake stage or a hosted external stage. Then the Snowflake Bulk origin downloads and processes those CSV files. The Snowflake Bulk origin can use multiple threads to process the files in parallel.
When you configure the Snowflake Bulk origin, you specify the Snowflake region, account, and connection information. You define the number of threads to use to connect to Snowflake and process data. You can specify an organization name to use and define additional Snowflake connection properties, as needed. You can also use a connection to configure the origin.
You specify the tables to read and optional WHERE clauses for each table. You configure staging location details and the data format of the stage files.
Prerequisites
Create a Snowflake Stage
Before using the Snowflake Bulk origin in a pipeline, you must create a Snowflake internal or external stage.
During the read process, the Snowflake Bulk origin sends a command to Snowflake to stage data as CSV files on either an internal Snowflake stage or a hosted external stage. Then the Snowflake Bulk origin downloads and processes those CSV files.
To use an external stage, create the external stage with the cloud service provider that hosts your Snowflake warehouse.
- Snowflake internal stage
- You can stage data in Snowflake internal user stages or named stages. Do not use internal table stages.
- Amazon S3 external stage
- To stage data in an Amazon S3 external stage, create a Snowflake external stage in a bucket in the same S3 region that hosts your Snowflake virtual warehouse. For example, if your Snowflake warehouse is in AWS US West, then create the Snowflake external stage in a bucket in the AWS US West region.
- Google Cloud Storage external stage
- To stage data in a Google Cloud Storage external stage, create a Snowflake storage integration in Google Cloud Storage. This is a multistep process described in the Snowflake documentation that ends with creating a Snowflake external stage. Be sure to complete all required steps.
- Microsoft Azure external stage
- To stage data in a Microsoft Azure external stage, complete the
following tasks:
- Configure Snowflake authentication for the Microsoft
Azure Blob Storage container that you want to use.
You can use an SAS token or an Azure account name and key for authentication. For information about configuring SAS token authentication, see the Snowflake documentation.
- Create a Snowflake external stage in the container.
When you create a Snowflake external stage, you specify a URL that defines the name and location for the stage. Include a trailing slash in the URL to ensure that Snowflake loads all staged data. You might also include a prefix in the stage name to indicate that the external stage is for Data Collector.
For example, the following URL creates an external stage namedsdc-externalstage
inazure://myaccount.blob.core.windows.net/mycontainer/load/
and loads all staged data to Snowflake:azure://myaccount.blob.core.windows.net/mycontainer/load/sdc-externalstage/
You can create an Azure stage using the Snowflake web interface or SQL. For more information, see Creating an Azure Stage in the Snowflake documentation.
- Configure Snowflake authentication for the Microsoft
Azure Blob Storage container that you want to use.
AWS Credentials
When the Snowflake Bulk origin reads data staged on Amazon S3, it must pass credentials to Amazon Web Services.
- Instance profile
- When the execution Data Collector runs on an Amazon EC2 instance that has an associated instance profile, Data Collector uses the instance profile credentials to automatically authenticate with AWS.
- AWS access key pair
-
When the execution Data Collector does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you can connect with an AWS access key pair.
To connect with an AWS access key pair, on the Staging tab, set the Authentication Method property to AWS Keys. Then define the access key ID and secret access key.Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores.
Google Cloud Storage Credentials
Before processing data staged on Google Cloud Storage, the Snowflake Bulk origin must pass credentials to Google Cloud.
- Google Cloud default credentials
- Credentials in a file
- Credentials in a stage property
For details on how to configure each option, see Security in Google Cloud Stages.
Assign Privileges
The Snowflake Bulk origin requires a Snowflake role that grants all privileges needed to process data.
Object Type | Privilege |
---|---|
External stage or internal Snowflake stage | READ, WRITE |
Table | SELECT |
Batch Processing
Unlike most Data Collector origins, the Snowflake Bulk origin performs batch processing only. After processing available data, the origin stops the pipeline rather than waiting for additional data, as with streaming pipelines. This design enables processing Snowflake data without incurring the costs of continuously connecting to Snowflake with streaming pipelines.
Define a Role
The Snowflake Bulk origin requires a Snowflake role that can read from Snowflake.
Before configuring the origin, ensure that you have granted the required privileges to a Snowflake role, as explained in Prerequisites.
- Assign the custom role as the default role
- In Snowflake, assign the custom role as the default role for the Snowflake user account specified in the stage. A Snowflake user account is associated with a single default role.
- Override the default role with the custom role
- In the stage, use the Role property to specify the name of the custom role. The custom role overrides the role assigned to the Snowflake user account specified in the stage.
- Use a Snowflake connection
- When working with Control Hub, you can configure a Snowflake connection to provide connection details for Snowflake stages.
Multithreaded Processing
The Snowflake Bulk origin performs parallel processing and enables the creation of a multithreaded pipeline.
When you start the pipeline, the Snowflake Bulk origin determines the tables to read based on the specified table configuration. The origin then uses multiple concurrent threads for processing based on the Connection Pool Size property on the Snowflake Connection Info tab.
As the pipeline runs, each thread processes one table: first copying data to the specified staging area, then downloading the data from the staging area. Each thread then creates batches based on the maximum batch size configured for the origin, and passes them to pipeline runners.
A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors, executors, and destinations in the pipeline and handles all pipeline processing after the origin. Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread.
Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since batches are processed by different pipeline runners, the order that batches are written to destinations is not ensured.
For more information about multithreaded pipelines, see Multithreaded Pipeline Overview.
Table Configurations
When you configure the Snowflake Bulk origin, you specify the tables to read. You can optionally specify WHERE clauses to limit the records that the origin processes.
- One table
- To process one table, you simply specify the table name in the Name Pattern property.
- Multiple tables
- To process multiple tables, configure one or more sets of the following
properties:
- Table Name Pattern - A single table name or a table name pattern that uses SQL LIKE syntax and evaluates to a set of tables.
- Exclusion Pattern - An optional name pattern that evaluates to table
names to exclude from processing. Use to exclude a subset of tables
defined by the Table Name Pattern property.
Define the property with a Java-based regular expression. For more information about using regular expressions, see Regular Expressions Overview.
- Where Clause - An optional WHERE clause that limits the records
processed from a specified table or table name pattern. When used
with a table name pattern, the WHERE clause is applied to every
table associated with the pattern.
Use Snowflake SQL to define a WHERE clause, and omit
WHERE
from the specified clause. For example, to use a WHERE clause to read sales data and omit returns, you might enter the following in the Where Clause property:total > 0
.For information about WHERE clause syntax, see the Snowflake documentation.
Example
Say you have a set of marketing tables that all have a -marketing
suffix for the name. You want to process all data in all of the tables except the
AU-Marketing
table. And you want to include data from the
Marketing-EU
table where Feedback
is not null.
To do this, you define two sets of table configuration properties as follows:
First set:
- Name Pattern:
%-Marketing
- Exclusion Pattern:
AU-Marketing
- Name Pattern:
Marketing-EU
- Where:
Feedback IS NOT NULL
Error Handling
You can configure Snowflake error handling in the Snowflake Bulk origin. The error
handling properties determine how the on_error
option is defined in the
Snowflake SQL query.
- Default - Does not set the
on_error
option. When not explicitly set, the Snowflake defaulton_error
behavior is Abort Statement. - Continue - Ignores errors and continues processing.
- Skip File
- When encountering errors, skips reading the batch.When you use this option, you also configure a Skip File On Error property to specify when to skip the file:
- First - After discovering the first error record.
- Number - After discovering the specified number of error records in the batch.
- Percentage - After discovering the specified percentage of error records in the batch.
- Abort Statement - Skips reading the batch upon the first error.
Record Header Attributes
The Snowflake Bulk origin includes basic Snowflake information in the record header attributes of the generated record.
- database - The database that contains the table that was processed.
- schema - The schema that contains the table that was processed.
- table - The table that contains the record that was processed.
Snowflake Data Types
The Snowflake Bulk origin converts Snowflake data types to Data Collector data types. The origin supports the following data types:
Snowflake Data Type | Data Collector Data Type |
---|---|
Array | List |
Bigint | Long |
Binary | Byte Array |
Boolean | Boolean |
Byteint | Long |
Char | String |
Date | Date |
Datetime | Datetime |
Decimal | Long |
Double | Double |
Double Precision | Double |
Float | Double |
Geography | String |
Int | Long |
Integer | Long |
Number | Long |
Number(<precision>, 0) |
Long |
Number(<precision>, <scale>) |
Long |
Nchar | String |
Nvarchar2 | String |
Real | Double |
Smallint | Long |
Time | Time |
Timestamp_LZT | Zoned Datetime |
Timestamp_NTZ | Datetime |
Timestamp_TZ | Zoned Datetime |
Tinyint | Long |
Varchar | String |
Variant | Conversion occurs based on content |
Configuring a Snowflake Bulk Origin
Configure a Snowflake Bulk origin to read all available data from Snowflake tables, and then stop the pipeline. Before you use the origin, complete the prerequisite tasks.