Google Cloud Storage
The Google Cloud Storage destination writes data to objects in Google Cloud Storage. For information about supported versions, see Supported Systems and Versions.
The destination creates an object for each batch of data written to Google Cloud Storage.
With the Google Cloud Storage destination, you configure the bucket and common prefix to define where to write objects. You can use a partition prefix to specify the partition to write to. You can configure a prefix for the object name, and a time basis and data time zone for the stage. When using any data format except whole file, you can also configure a suffix for the object name and compress data with gzip before writing to Google Cloud Storage.
You also define the project ID and credentials to use when connecting to Google Cloud Storage.
You can also use a connection to configure the destination.
The destination can generate events for an event stream. For more information about the event framework, see Dataflow Triggers Overview.
Credentials
Before writing to Google Cloud Storage, the Google Cloud Storage destination must pass credentials to Google Cloud Storage.
- Google Cloud default credentials
- Credentials in a file
- Credentials in a stage property
For details on how to configure each option, see Security in Google Cloud Stages.
Partition Prefix
You can use a partition prefix to organize objects by partitions. You can use the partition prefix to write to existing partitions or to create new partitions as needed. When a partition specified in the partition prefix does not exist, the destination creates the partition.
You can specify an exact partition name for the partition prefix, or you can use an expression that evaluates to a partition name.
For example, to write to partitions based on data in the Country field, you can use the
following expression as the partition prefix:
${record:value('/Country')}
.
With this expression, the destination writes records to partitions based on the country data in the record, and creates partitions for countries that do not already have a partition.
If you use datetime variables in the expression, be sure to configure the time basis for the stage.
Time Basis, Data Time Zone, and Time-Based Partition Prefixes
The time basis and the data time zone comprises the time used by the Google Cloud Storage destination to write records to a time-based partition prefix. When the configured partition prefix does not include time-based functions, you can ignore the time basis property.
A partition prefix has a time component when it includes datetime variables, such as
${YYYY()}
or ${DD()}
, or when it includes an
expression that evaluates to a datetime value, such as
${record:value("/Timestamp")}.
For details about datetime variables, see Datetime Variables.
- Processing Time
- When you use processing time as the time basis, the destination performs
writes based on the processing time and the configured partition prefix. The
processing time is the time associated with the Data Collector running the pipeline, by default. You can specify a different time zone
by configuring the Data Time Zone property. To use the processing time as
the time basis, use the following expression:
This is the default time basis.${time:now()}
- Record Time
- When you use the time associated with a record as the time basis, you specify a date field in the record. The destination writes data based on the datetimes associated with the records, adjusting for the value specified for the Data Time Zone property.
logs-${YYYY()}-${MM()}-${DD()}
If you use the time of processing as the time basis, the destination writes records to partitions based on when it processes each record. If you use the time associated with the data, such as a transaction timestamp, then the destination writes records to the partitions based on that timestamp. If a partition does not exist, the destination creates the needed partition.
Object Names
<prefix>-<UUID>
You configure the object name prefix. For example:
sdc-c9a2db16-b5d0-44cb-b3f5-d0781cced760
.
<prefix>-<UUID>.<optional suffix>
For example: sdc-c9a2db16-b5d0-44cb-b3f5-d0781cced760.txt
.
Whole File Names
<prefix>-<results of the file name expression>
Event Generation
The Google Cloud Storage destination can generate events that you can use in an event stream. When you enable event generation, Google Cloud Storage generates event records each time the destination completes writing to an object or completes streaming a whole file.
- With the Google Cloud Storage executor to perform tasks after writing an object or whole file.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses one of the following types:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- Object written
- The destination generates an object written event record when it completes writing to an object.
- Whole file processed
- The destination generates an event record when it completes streaming a
whole file. Whole file event records have the
sdc.event.type
record header attribute set towholeFileProcessed
and include the following fields:Field Description sourceFileInfo A map of attributes about the original whole file that was processed. The attribute names depend on the information provided by the origin system.
targetFileInfo A map of attributes about the whole file written to the destination system. The attributes include: - bucket - The bucket where the whole file is written.
- objectKey - The object key name that was written.
checksum Checksum generated for the written file. Included only when you configure the destination to include checksums in the event record.
checksumAlgorithm Algorithm used to generate the checksum. Included only when you configure the destination to include checksums in the event record.
Data Formats
- Avro
- The destination writes records based on the Avro schema. You can use one of the following methods to specify the location of the Avro schema definition:
- Delimited
- The destination writes records as delimited data. When you use this data format, the root field must be list or list-map.
- JSON
- The destination writes records as JSON data. You can use one of
the following formats:
- Array - Each file includes a single array. In the array, each element is a JSON representation of each record.
- Multiple objects - Each file includes multiple JSON objects. Each object is a JSON representation of a record.
- Parquet
- The destination writes a Parquet file for each partition and includes the Parquet schema in every file.
- Protobuf
- Writes a batch of messages in each file.
- SDC Record
- The destination writes records in the SDC Record data format.
- Text
- The destination writes data from a single text field to the destination system. When you configure the stage, you select the field to use.
- Whole File
- Streams whole files to the destination system. The destination writes the data to the file and location defined in the stage. If a file of the same name already exists, you can configure the destination to overwrite the existing file or send the current file to error.