Amazon S3
The Amazon S3 origin reads objects stored in Amazon Simple Storage Service, also known as Amazon S3. The objects must be fully written, include data of the same supported format, and use the same schema.
When reading multiple objects in a batch, the origin reads the oldest object first. Upon successfully reading an object, the origin can delete the object, move it to an archive directory, or leave it in the directory.
When the pipeline stops, the origin notes the last-modified timestamp of the last object that it processed and stores it as an offset. When the pipeline starts again, the origin continues processing from the last-saved offset by default. You can reset pipeline offsets to process all available objects.
Before you run a pipeline that uses the Amazon S3 origin, make sure to complete the prerequisite tasks.
When you configure the origin, you specify the authentication method to use. You define the bucket and path to the objects to read. The origin reads objects from the specified directory and its subdirectories. If the origin reads partition objects grouped by field, you must specify the partition base path to include the fields and field values in the data. You also specify the name pattern for the objects to read. You can optionally configure another name pattern to exclude objects from processing and define post-processing actions for successfully read objects.
You can also use a connection to configure the origin.
You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties. You can also configure advanced properties such as performance-related properties and proxy server properties.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.
Prerequisites
Before reading from Amazon S3 with the Amazon S3 origin, complete the following prerequisite tasks:
- Verify permissions
- The user associated with the authentication credentials in effect must have
the following Amazon S3 permissions:
- READ permission on the bucket
- s3:ListBucket permission in an AWS Identity and Access Management (IAM) policy
- Use a single schema
- All objects processed by the Amazon S3 origin must have the same schema.
- Perform prerequisite tasks for local pipelines
-
To connect to Amazon S3, Transformer uses connection information stored in a Hadoop configuration file. Before you run a local pipeline that connects to Amazon S3, complete the prerequisite tasks.
Authentication Method
You can configure the Amazon S3 origin to authenticate with Amazon Web Services (AWS) using an instance profile or AWS access keys. When accessing a public bucket, you can connect anonymously using no authentication.
For more information about the authentication methods and details on how to configure each method, see Amazon Security.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.
- Delimited, JSON, text, or XML
- When reading text-based data, Spark can split the object into multiple partitions for processing, depending on the underlying file system. Multiline JSON files cannot be split.
- Avro, ORC, or Parquet
- When reading Avro, ORC, or Parquet data, Spark can split the object into
multiple partitions for processing.
Spark uses these partitions while the pipeline processes the batch unless a processor causes Spark to shuffle the data. To change the partitioning in the pipeline, use the Repartition processor.
Data Formats
The Amazon S3 origin generates records based on the specified data format.
- Avro
- The origin generates a record for every Avro record in the object. Each object must contain the Avro schema. The origin uses the Avro schema to generate records.
- Delimited
- The origin generates a record for each delimited line in the object. You can specify a custom delimiter, quote, and escape character used in the data.
- JSON
- By default, the origin generates a record for each line in the object. Each line in the object must contain valid JSON Lines data. For details, see the JSON Lines website.
- ORC
- The origin generates a record for each Optimized Row Columnar (ORC) row in the object.
- Parquet
- The origin generates records for every Parquet record in the object. The object must contain the Parquet schema. The origin uses the Parquet schema to generate records.
- Text
- The origin generates a record for each text line in the object. The object
must use
\n
as the newline character. - XML
- The origin generates a record for every row in the object. You specify the root tag used in files and the row tag used to define records.
Configuring an Amazon S3 Origin
Configure an Amazon S3 origin to read data in Amazon S3. Before you run the pipeline, make sure to complete the prerequisite tasks.