Amazon S3
The Amazon S3 destination writes data to Amazon S3. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
To write data to an Amazon Kinesis Firehose delivery system, use the Kinesis Firehose destination. To write data to Amazon Kinesis Streams, use the Kinesis Producer destination.
With the Amazon S3 destination, you configure the region, bucket, and common prefix to define where to write objects. You can use a partition prefix to specify the S3 partition to write to. You can configure a prefix and suffix for the object name, and a time basis and data time zone for the stage. You can also configure the destination to add tags to the Amazon S3 objects that it creates.
You configure the authentication method that the destination uses to connect to Amazon S3.
The Amazon S3 destination can write data asynchronously to improve performance when writing to multiple prefixes. You can configure advanced properties to tune performance.
You can configure the destination to use Amazon Web Services server-side encryption to protect the data written to Amazon S3. You can also use a proxy user and compress data with gzip when writing to Amazon S3.
The Amazon S3 destination creates an object for each batch of data written to Amazon S3.
You can also use a connection to configure the destination.
The destination can generate events for an event stream. For more information about the event framework, see Dataflow Triggers Overview.
Authentication Method
You can configure the Amazon S3 destination to authenticate with Amazon Web Services (AWS) using an instance profile or AWS access keys. When accessing a public bucket, you can connect anonymously using no authentication.
For more information about the authentication methods and details on how to configure each method, see Security in Amazon Stages.
Bucket
When you configure the bucket where records should be written, you can specify an exact bucket name or you can use an expression that evaluates to a bucket name.
For example, to write to buckets based on data in the Type field, you can use the
following expression to define the bucket: ${record:value('/Type)}
.
With this expression, the destination writes records to buckets based on the data in the Type field. If an expression evaluates to a bucket that does not exist, the destination handles the record based on the error handling configured in the stage.
If you use datetime variables in the expression, be sure to configure the time basis for the stage.
Partition Prefix
You can use a partition prefix to organize objects by partitions. You can use the partition prefix to write to existing partitions or to create new partitions as needed. When a partition specified in the partition prefix does not exist, the destination creates the partition.
You can specify an exact partition name for the partition prefix, or you can use an expression that evaluates to a partition name.
For example, to write to partitions based on data in the Country field, you can use the
following expression as the partition prefix:
${record:value('/Country')}
.
With this expression, the destination writes records to partitions based on the country data in the record, and creates partitions for countries that do not already have a partition.
If you use datetime variables in the expression, be sure to configure the time basis for the stage. You might also need to configure the Data Time Zone property.
Time Basis and Data Time Zone for Time-Based Buckets and Partition Prefixes
The time basis and the data time zone comprises the time used by the Amazon S3 destination to write records to a time-based bucket or partition prefix. When the configured bucket or partition prefix does not include time-based functions, you can ignore the time basis property.
A bucket or partition prefix has a time component when it includes datetime variables,
such as ${YYYY()}
or ${DD()}
, or when it includes an
expression that evaluates to a datetime value, such as
${record:valueOrDefault("/Timestamp")}.
For details about datetime variables, see Datetime Variables.
- Processing Time
- When you use processing time as the time basis, the destination performs
writes based on the processing time and the configured bucket and partition
prefix. The processing time is the time associated with the Data Collector running the pipeline, by default. You can specify a different time zone
by configuring the Data Time Zone property. To use the processing time as
the time basis, use the following expression:
This is the default time basis.${time:now()}
- Record Time
- When you use the time associated with a record as the time basis, you specify a date field in the record. The destination writes data based on the datetimes associated with the records, adjusting for the value specified for the Data Time Zone property.
logs-${YYYY()}-${MM()}-${DD()}
If you use the time of processing as the time basis, the destination writes records to partitions based on when it processes each record. If you use the time associated with the data, such as a transaction timestamp, then the destination writes records to the partitions based on that timestamp. If a partition does not exist, the destination creates the needed partition.
${YYYY()}-${MM()}
If you use the time of processing as the time basis, the destination writes records to buckets based on when it processes each record. If you use the time associated with the data, such as a transaction timestamp, then the destination writes records to the buckets based on that timestamp. If a bucket does not exist, the destination handles the record based on the error record handling configured for the stage.
Object Names
<prefix>-<UTC timestamp>-<counter>
For example: sdc-1462405014177-1
.
You configure the object name prefix.
The UTC timestamp is the time when the object is created, to the millisecond. The counter is used when multiple objects are created in the same millisecond.
<prefix>-<UTC timestamp>-<counter>.<optional suffix>
For example: sdc-1462405014177-1.txt
.
Whole File Names
<prefix>-<results of the file name expression>
Add Tags to Objects
You can configure the Amazon S3 destination to add tags to the Amazon S3 objects that it creates. Tags are key-value pairs that you can use to categorize objects, such as product: <product>.
You can configure multiple tags. When you configure a tag, you can define a tag with just the key or specify a key and value.
For more information about tags, including Amazon S3 restrictions, see the Amazon S3 documentation.
Event Generation
The Amazon S3 destination can generate events that you can use in an event stream. When you enable event generation, the Amazon S3 destination generates event records each time after writing to an object or streaming a whole file.
- With the Amazon S3 executor to add metadata to closed objects or whole files after receiving an event.
- With the Spark executor to run a Spark application after receiving an event.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses one of the following types:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- Object written
- The destination generates an object written event record when it completes writing to an object.
- Whole file processed
- The destination generates an event record when it completes streaming a
whole file. Whole file event records have the
sdc.event.type
record header attribute set towholeFileProcessed
and include the following fields:Field Description sourceFileInfo A map of attributes about the original whole file that was processed. The attribute names depend on the information provided by the origin system.
targetFileInfo A map of attributes about the whole file written to the destination system. The attributes include: - bucket - The bucket where the whole file is written.
- objectKey - The object key name that was written.
checksum Checksum generated for the written file. Included only when you configure the destination to include checksums in the event record.
checksumAlgorithm Algorithm used to generate the checksum. Included only when you configure the destination to include checksums in the event record.
Server-Side Encryption
You can configure the stage to use Amazon Web Services server-side encryption (SSE) to protect data written to Amazon S3. When configured for server-side encryption, the stage passes required server-side encryption configuration values to Amazon S3. Amazon S3 uses the values to encrypt the data as it is written to Amazon S3.
- Amazon S3-Managed Encryption Keys (SSE-S3)
- When you use server-side encryption with Amazon S3-managed keys, Amazon S3 manages the encryption keys for you.
- AWS KMS-Managed Encryption Keys (SSE-KMS)
- When you use server-side encryption with AWS Key Management Service (KMS), you specify the Amazon resource name (ARN) of the AWS KMS encryption key that you want to use. You can also specify key-value pairs to use for the encryption context.
- Customer-Provided Encryption Keys (SSE-C)
- When you use server-side encryption with customer-provided keys, you specify
the following information:
- Base64 encoded 256-bit encryption key
- Base64 encoded 128-bit MD5 digest of the encryption key using RFC 1321
For more information about using server-side encryption to protect data in Amazon S3, see the Amazon S3 documentation.
Data Formats
The Amazon S3 destination writes data to Amazon S3 based on the data format that you select.
- Avro
- The destination writes records based on the Avro schema. You can use one of the following methods to specify the location of the Avro schema definition:
- Binary
- The stage writes binary data to a single field in the record.
- Delimited
- The destination writes records as delimited data. When you use this data format, the root field must be list or list-map.
- JSON
- The destination writes records as JSON data. You can use one of
the following formats:
- Array - Each file includes a single array. In the array, each element is a JSON representation of each record.
- Multiple objects - Each file includes multiple JSON objects. Each object is a JSON representation of a record.
- Parquet
- The destination writes an object for each partition and includes the Parquet schema in every object.
- Protobuf
- Writes a batch of messages in each file.
- SDC Record
- The destination writes records in the SDC Record data format.
- Text
- The destination writes data from a single text field to the destination system. When you configure the stage, you select the field to use.
- Whole File
- Streams whole files to the destination system. The destination writes the data to the file and location defined in the stage. If a file of the same name already exists, you can configure the destination to overwrite the existing file or send the current file to error.