Google Cloud Storage
The Google Cloud Storage origin reads objects stored in Google Cloud Storage. The objects must be fully written and reside in a single bucket. The object names must share a prefix pattern. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
With the Google Cloud Storage origin, you define the bucket, prefix pattern, and optional common prefix. These properties determine the objects that the origin processes.
You also define the project ID and credentials to use when connecting to Google Cloud Storage. You can also use a connection to configure the origin.
After processing an object or upon encountering errors, the origin can keep, archive, or delete the object. When archiving, the origin can copy or move the object.
The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Credentials
Before reading objects in Google Cloud Storage, the origin must pass credentials to Google Cloud Storage.
- Google Cloud default credentials
- Credentials in a file
- Credentials in a stage property
For details on how to configure each option, see Security in Google Cloud Stages.
Common Prefix, Prefix Pattern, and Wildcards
The Google Cloud Storage origin appends the common prefix to the prefix pattern to define the objects that the origin processes. You can specify an exact prefix pattern or you can use Ant-style path patterns to read multiple objects recursively.
- Question mark (?) to match a single character
- Asterisk (*) to match zero or more characters
- Double asterisks (**) to match zero or more directories
US/East/MD/
and all nested
prefixes, you can use the following common prefix and prefix
pattern:Common Prefix: US/East/MD/
Prefix Pattern: **/*.log
US/**/weblogs/
, you can include the nested prefixes in the
prefix pattern or define the entire hierarchy in the prefix pattern, as
follows:Common Prefix: US/
Prefix Pattern: **/weblogs/*.log
Common Prefix:
Prefix Pattern: US/**/weblogs/*.log
Record Header Attributes
When the Google Cloud Storage origin processes Parquet data and Skip Union Indexes is
not enabled, it generates an avro.union.typeIndex./id
record header attribute identifying the index number of the
element in a union the data is read from.
Event Generation
The Google Cloud Storage origin can generate events when it completes processing all available data and the configured batch wait time has elapsed.
- With the Google Cloud Storage executor to perform tasks after writing an object or whole file.
- With the Pipeline Finisher executor to
stop the pipeline and transition the pipeline to a Finished state when
the origin completes processing available data.
When you restart a pipeline stopped by the Pipeline Finisher executor, the origin continues processing from the last-saved offset unless you reset the origin.
For an example, see Stopping a Pipeline After Processing All Available Data.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses the following type:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
The Google Cloud Storage origin can generate the following event record:
- no-more-data
- The Google Cloud Storage origin generates a no-more-data event record when the origin completes processing all available records and the number of seconds configured for Batch Wait Time elapses without any new objects appearing to be processed.
Data Formats
- Avro
- Generates a record for every Avro record. Includes a
precision
andscale
field attribute for each Decimal field. - Binary
- Generates a record with a single byte array field at the root of the record.
- Delimited
- Generates a record for each delimited line.
- Excel
- Generates a record for every row in the file. Can process
.xls
or.xlsx
files.You can configure the origin to read from all sheets in a workbook or from particular sheets in a workbook. You can specify whether files include a header row and whether to ignore the header row. You can also configure the origin to skip cells that do not have a corresponding header value. A header row must be the first row of a file. Vertical header columns are not recognized.
The origin cannot process Excel files with large numbers of rows. You can save such files as CSV files in Excel, and then use the origin to process with the delimited data format.
- JSON
- Generates a record for each JSON object. You can process JSON files that include multiple JSON objects or a single JSON array.
- Parquet
- The origin generates records for every Parquet record in the file.
The file must contain the Parquet schema. The origin uses the
Parquet schema to generate records.
The stage includes the Parquet schema in a
parquetSchema
record header attribute.When Skip Union Indexes is not enabled, the origin generates an
avro.union.typeIndex./id
record header attribute identifying the index number of the element in the union that the data is read from. If a schema contains many unions and the pipeline does not depend on index information, you can enable Skip Union Indexes to avoid long processing times associated with storing a large number of indexes. - Log
- Generates a record for every log line.
- Protobuf
- Generates a record for every protobuf message.
- SDC Record
- Generates a record for every record. Use to process records generated by a Data Collector pipeline using the SDC Record data format.
- Text
- Generates a record for each line of text or for each section of text based on a custom delimiter.
- Whole File
- Streams whole files from the origin system to the destination system. You can specify a transfer rate or use all available resources to perform the transfer.