Amazon S3
The Amazon S3 destination writes objects to Amazon S3. The Amazon S3 destination writes data based on the specified data format and creates a separate object for every partition.
Before you run a pipeline that uses the Amazon S3 destination, make sure to complete the prerequisite tasks.
When you configure the Amazon S3 destination, you specify the authentication method to use. You can specify Amazon S3 server-side encryption for the data. You can also use a connection to configure the destination.
You specify the output location and write mode to use. You can configure the destination to group partition objects by field values. If you configure the destination to overwrite related partitions, you must configure Spark to overwrite partitions dynamically. You can also configure the destination to drop unrelated master records when using the destination as part of a slowly changing dimension pipeline.
You select the data format to write and configure related properties.
You can also configure advanced properties such as performance-related properties and proxy server properties.
Prerequisites
- Verify permissions
- The user associated with the authentication credentials in effect must have WRITE permission on the S3 bucket.
- Perform prerequisite tasks for local pipelines
-
To connect to Amazon S3, Transformer uses connection information stored in a Hadoop configuration file. Before you run a local pipeline that connects to Amazon S3, complete the prerequisite tasks.
URI Scheme
You can use the s3
or s3a
URI scheme when you specify
the bucket to write to. The URI scheme determines the underlying client that the
destination uses to write to Amazon S3.
While both URI schemes are supported for EMR clusters, Amazon recommends using the
s3
URI scheme with EMR clusters for better performance, security,
and reliability. For all other clusters, use the s3a
URI scheme.
For more information, see the Amazon documentation.
Authentication Method
You can configure the Amazon S3 destination to authenticate with Amazon Web Services (AWS) using an instance profile or AWS access keys. When accessing a public bucket, you can connect anonymously using no authentication.
For more information about the authentication methods and details on how to configure each method, see Amazon Security.
Server-Side Encryption
You can configure the destination to use Amazon Web Services server-side encryption (SSE) to protect data written to Amazon S3. When configured for server-side encryption, the destination passes required server-side encryption configuration values to Amazon S3. Amazon S3 uses the values to encrypt the data as it is written to Amazon S3.
- Amazon S3-Managed Encryption Keys (SSE-S3)
- When you use server-side encryption with Amazon S3-managed keys, Amazon S3 manages the encryption keys for you.
- AWS KMS-Managed Encryption Keys (SSE-KMS)
- When you use server-side encryption with AWS Key Management Service (KMS), you specify the Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use.
- Customer-Provided Encryption Keys (SSE-C)
- When you use server-side encryption with customer-provided keys, you specify the Base64 encoded 256-bit encryption key.
For more information about using server-side encryption to protect data in Amazon S3, see the Amazon S3 documentation.
Write Mode
The write mode determines how the Amazon S3 destination writes objects to Amazon S3. When writing objects, the resulting names are based on the selected data format.
- Overwrite files
- Before writing any data from a batch, the destination removes all objects from the specified bucket and path.
- Overwrite related partitions
- Before writing data from a batch, the destination removes the objects from subfolders for which a batch has data. The destination leaves a subfolder intact if the batch has no data for that subfolder.
- Write new files to new directory
- When the pipeline starts, the destination writes objects in the specified bucket and path. Before writing data for each subsequent batch during the pipeline run, the destination removes any objects from the bucket and path. The destination generates an error if the specified bucket and path contains objects when you start the pipeline.
- Write new or append to existing files
- The destination creates new objects if they do not exist or appends data to an existing object if the object exists in the same bucket and path.
Spark Requirement to Overwrite Related Partitions
If you set Write Mode to Overwrite Related Partitions, you must configure Spark to overwrite partitions dynamically.
Set the Spark configuration property
spark.sql.sources.partitionOverwriteMode
to
dynamic
.
You can configure the property in Spark, or you
can configure the property in individual pipelines. For example, you might set the
property to dynamic
in Spark when you plan to enable the Overwrite
Related Partitions mode in most of your pipelines, and then set the property to
static
in individual pipelines that do not use that mode.
To configure the property in an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.
Partition Objects
Pipelines process data in partitions. The Amazon S3 destination writes objects, which contain the processed data, in the configured bucket and path. The destination writes one object for each partition.
The destination groups the partition objects if you specify one or more fields in the Partition by Fields property. For each unique value of the fields specified in the Partition by Fields property, the destination creates a folder. In each folder, the destination writes one object for each partition that has the corresponding field and value. Because the folder name includes the field name and value, the object omits that data. With grouped partition objects, you can more easily find data with certain values, such as all the data for a particular city.
To overwrite folders that have updated data and leave other folders intact, set the Write Mode property to Overwrite Related Partitions. Then, the destination clears affected folders before writing, replacing objects in those folders with new objects.
If the Partition by Fields property lists no fields, the destination does not group partition objects and writes one object for each partition directly in the configured bucket and path.
Because the text data format only contains data from one field, the destination does not group partition objects for the text data format. Do not configure the Partition by Fields property if the destination writes in the text data format.
Example: Grouping Partition Objects
Suppose your pipeline processes orders. You want to write data in the Sales bucket and group the data by cities. Therefore, in the Bucket and Path property, you enter Sales, and in the Partition by Fields property, you enter City. Your pipeline is a batch pipeline that only processes one batch. You want to overwrite the entire Sales bucket each time you run the pipeline. Therefore, in the Write Mode property, you select Overwrite Files.
As the pipeline processes the batch, the origins and processors lead Spark to split the data into three partitions. The batch contains two values in the City field: Oakland and Fremont. Before writing the processed data in the Sales bucket, the destination removes any existing objects and folders and then creates two folders, City=Oakland and City=Fremont. In each folder, the destination writes one object for each partition that contains data for that city, as shown below:

Note that the written objects do not include the City field; instead, you infer the city from the folder name. The first partition does not include any data for Fremont; therefore, the City=Fremont folder does not contain an object from the first partition. Similarly, the third partition does not include any data for Oakland; therefore, the City=Oakland folder does not contain an object from the third partition.
Data Formats
The Amazon S3 destination writes records based on the specified data format.
- Avro
- The destination writes an object for each partition and includes the Avro schema in each object.
- Delimited
- The destination writes an object for each partition. It creates a header
line for each file and uses
\n
as the newline character. You can specify a custom delimiter, quote, and escape character to use in the data.
- JSON
- The destination writes an object for each partition and writes each record on a separate line. For more information, see the JSON Lines website.
- ORC
- The destination writes an object for each partition.
- Parquet
- The destination writes an object for each partition and includes the Parquet schema in every object.
- Text
- The destination writes an object for every partition and uses
\n
as the newline character. - XML
- The destination writes an object for every partition. You specify the root and row tags to use in output files.