Amazon Redshift
The Amazon Redshift destination writes data to an Amazon Redshift table. Use the destination in Databricks, Dataproc, or EMR pipelines.
The Amazon Redshift destination stages data on Amazon S3 before writing it to Redshift. When you configure the destination, you specify the staging location on S3. You can optionally enable server-side encryption and have the destination delete objects from S3 after writing them to Redshift. You can also use a connection to configure the destination.
You define the Amazon Redshift endpoint, schema, and table to write to. You can optionally define fields to partition by.
You specify the write mode to use: insert, merge, or delete. You can use compound keys for merges and deletes. When merging data, you can specify a distribution key to improve performance. When deleting data, you can specify the action to take for records with no match in the table.
You can optionally have the destination truncate the table before writing to it. You can also have the destination create the table. When creating the table, you specify the Redshift distribution style to use, and you specify the default length and any custom lengths that you want to use for Varchar fields.
You configure security credentials and the database user for the write. When using AWS access keys, you can have the destination automatically create the user. You can also configure advanced properties such as performance-related properties and proxy server properties.
Before using the Amazon Redshift destination, verify if you need to install a JDBC driver.
This destination has been tested with Amazon Redshift versions 5.13.0 and 5.29.0.
Installing a JDBC Driver
- Instance profile
- When using an instance profile to authenticate with Redshift, you do not need to install a JDBC driver. The Amazon Redshift destination uses a JDBC driver included with the destination.
- AWS access keys
- When using AWS access keys to authenticate with Redshift, you must install an Amazon Redshift JDBC driver.
Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster.
Credentials and Writing to Redshift
You can specify how securely the destination authenticates with Amazon Redshift. The credentials that you use determine the additional information that you must provide and whether you need to install a JDBC driver. Configure credentials on the Credentials tab of the destination.
The destination can authenticate using the following credentials:
- Instance profile
- When the EC2 nodes in the EMR cluster have an associated instance profile, Transformer uses the instance profile credentials to automatically authenticate with AWS. The IAM policies attached to the instance profile must have permissions to write to Amazon S3 and to the Redshift cluster.
- AWS access keys
- You can authenticate using an AWS access key pair. When using an AWS access
key pair, you specify the access key ID and secret access key to use. The
AWS access key pair must have permissions to write to Amazon S3 and to the
Redshift destination. When you use AWS access keys, you also specify the following details to enable writing to Redshift:
- DB User - Database user that Transformer impersonates when writing to the database. The user must have write permission for the database table.
- Auto-Create DB User - Enables creating a database user to write data to Redshift.
- DB Groups - Comma-delimited list of existing database groups for the database user to join for the duration of the pipeline run. The specified groups must have write permission for the S3 staging location.
Server-Side Encryption
You can configure the destination to use Amazon Web Services server-side encryption (SSE) to protect data staged on Amazon S3. When configured for server-side encryption, the destination passes required server-side encryption configuration values to Amazon S3. Amazon S3 uses the values to encrypt the data as it is written to Amazon S3.
- Amazon S3-Managed Encryption Keys (SSE-S3)
- When you use server-side encryption with Amazon S3-managed keys, Amazon S3 manages the encryption keys for you.
- AWS KMS-Managed Encryption Keys (SSE-KMS)
- When you use server-side encryption with AWS Key Management Service (KMS), you specify the Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use.
- Customer-Provided Encryption Keys (SSE-C)
- When you use server-side encryption with customer-provided keys, you specify the Base64 encoded 256-bit encryption key.
For more information about using server-side encryption to protect data in Amazon S3, see the Amazon S3 documentation.
Write Mode
The write mode determines how the Amazon Redshift destination writes to Redshift.
- Insert
- Inserts all data to the table.
- Merge
- Inserts new data and updates existing data in the table. When merging data, you specify one or more primary key columns to uniquely identify each record to be written. If the table includes a stable distribution key, you can also specify the distribution key column, which can improve performance.
- Delete
- Deletes rows with matching key columns. When deleting data, you specify one or more primary key columns to uniquely identify each record to be deleted. You also specify the action to take for records without matching rows in the table:
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.
When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.
When staging data on Amazon S3, Spark creates one object for each partition. When you configure the Amazon Redshift destination, you can specify fields to partition by. You can alternatively use a Repartition processor earlier in the pipeline to partition by fields or to specify the number of partitions that you want to use.
Configuring an Amazon Redshift Destination
Configure an Amazon Redshift destination to write data to an Amazon Redshift table. Before using the destination, verify if you need to install a JDBC driver.