Google BigQuery
The Google BigQuery target loads new data or change data capture (CDC) data to Google BigQuery. The target can compensate for data drift to support loading to new or existing datasets, tables, and columns. For information about supported versions, see Supported systems and versions.
To load data, the target first stages the flow data in CSV or Avro files in a staging area in Google Cloud Storage. Then, the target creates a BigQuery batch load job that copies the staged files into BigQuery. For information about the Google BigQuery quota policy for batch loading data, see the Google Cloud BigQuery documentation.
When you configure the target, you specify authentication information for Google BigQuery and for your Google Cloud Storage staging area. You can also use a connection to configure the target. You can optionally configure the target to connect to Google BigQuery through a proxy server.
You specify the name of the dataset and tables to load the data to. The target loads data from record fields to table columns based on matching names.
You can configure the target to compensate for data drift by creating new columns in existing tables when new fields appear in records or by creating new tables or datasets as needed. When creating new tables, you can configure how the target partitions the tables.
You can configure the root field for the row, and any first-level fields that you want to exclude from the record. You can specify characters to represent null values.
When processing CDC data, you can specify the key columns in the BigQuery table that the target uses to evaluate the merge condition.
You can configure the target to replace missing fields or fields with invalid data types with user-defined default values. You can also configure the target to replace newline characters and trim leading and trailing spaces.
Before you use the Google BigQuery target, you must complete some prerequisite tasks.
Prerequisites
Prepare the Google BigQuery data warehouse
About this task
Before configuring the Google BigQuery target, prepare your BigQuery data warehouse.
Procedure
Prepare the Google Cloud Storage staging area
About this task
Before configuring the Google BigQuery target, prepare a staging area in Google Cloud Storage. The target stages CSV or Avro files in the staging area before loading them to Google BigQuery.
Procedure
Credentials
When the Google BigQuery target stages CSV or Avro files in the Cloud Storage staging area and then loads the files to BigQuery, the target must pass credentials to Google Cloud Storage and then to Google BigQuery.
- Google Cloud default credentials
- Credentials in a file
- Credentials in a stage property
For details on how to configure each option, see Security in Google Cloud stages.
Staging file formats
The Google BigQuery target stages data in Avro, CSV, JSON, or Parquet files. For more information about staging file formats in BigQuery, see the BigQuery documentation.
Note the following staging file format limits:
- The target supports nested data types for Avro or JSON staging files only.
- For CSV staging files, the target does not support nested Data Collector data types. The target treats these fields as having an invalid data type.
For more information, see Row generation.
You specify the file format using the Staging File Format property on the Data tab.
Specifying datasets and tables
You can use the Google BigQuery target to load data to one or more tables in one or more datasets.
- Single dataset or table
- To write data to a single dataset, simply enter the dataset name as
follows:
<dataset_name> - Multiple datasets or tables
- To write data to multiple datasets and tables, specify a field in the record that defines the datasets and tables.
Use the Dataset and Table properties on the Table Definition tab to specify the datasets and tables to write to.
Enabling data drift handling
The Google BigQuery target can automatically compensate for changes in column, table, or dataset requirements, also known as data drift.
- Create new columns
-
The target can create new columns in existing tables when new fields appear in records. For example, if a record suddenly includes a new
Address2field, the target creates a newAddress2column in the target table. The target always creates new columns in existing tables as Nullable columns.By default, the target creates new columns based on the data in the new fields, such as creating an Int column for integer data. You can, however, configure the target to create all new columns as String.
To enable the automatic creation of new columns, select the Enable Data Drift property on the Table Definition tab. To create all new columns as String, select the Create New Columns as String property.
- Create new tables
- When data drift handling is enabled, you can also configure the target
to create new tables as needed. For example, say the target writes data
to tables based on the region name in the
Regionfield. When a newSW-3region shows up in a record, the target creates a newSW-3table and writes the record to the new table. - Create new datasets
- When data drift handling and the creation of new tables are enabled, you can
also configure the target to create new datasets as needed. For
example, say the target writes data to datasets based on the department
name in the
deptfield. When a newEngineeringdepartment shows up in a record, the target creates a newEngineeringdataset and writes the record to a new table in the new dataset.
Partitioning new tables
When creating a new table, the Google BigQuery target can also partition the table.
To enable partitioning for new tables, first enable data drift and select the Create Table property on the Table Definition tab. Then, select the Partition Table property and define partition configurations.
You can add multiple partition configurations so that the target can partition new tables in different ways. Because BigQuery can partition a table only once, the target uses a single partition configuration when it creates a table.
For each partition configuration, you specify whether the configuration should be applied to a specific table or whether the configuration is the default. Then, you specify the partition options.
Partition configurations
- To a specific table
- To create a partition configuration for a specific table, enter the dataset and table name to apply the partition configuration to.
- As the default
- To create the default partition configuration that the target uses when it cannot find a matching configuration for a new table, select Default Partition. You can create one default partition configuration.
When you enable the creation and partitioning of new tables, the target first looks for a matching partition configuration by dataset and table name. If the target cannot find a match, it uses the default configuration. If a default configuration does not exist, the target creates the table without partitioning.
Region field. The Region field can have one of the
following values: WestRegion, EastRegion, and
NorthRegion. You also enable the creation and partitioning of new
tables. You can set up partition configurations for the three new tables in the
following ways:- Different partitioning for each table - Create three partition
configurations, one for each table:
WestRegion,EastRegion, andNorthRegion. - Same partitioning for each table - Create a single default partition.
- Different partitioning for one table, same partitioning for the remaining
tables - Create a partition configuration for the
WestRegiontable. Then create a default partition configuration used for the remaining tables. - Partition one table, do not partition the remaining tables - Create a
partition configuration for the
WestRegiontable. Do not create a default partition configuration.
Partition options
- Date
- Datetime
- Timestamp
- Integer
- Ingestion
For the Date, Datetime, Timestamp, and Integer partition type, you also must specify the column to partition by. The data type of the column must match the selected partition type. For example, to partition by the Integer type, the data type of the column must be Integer.
You configure additional options for each partition type as required by BigQuery. For details about the available options for each partition type, see the Google Cloud BigQuery documentation.
Row generation
When writing a record to a table, the Google BigQuery target includes all record
fields in the resulting row, by default. The target uses the root field,
/, as the basis for the resulting row.
You can configure the Row Field property on the Data tab to specify a map or list-map field in the record as the basis for the row. The resulting record includes only the data from the specified map or list-map field and excludes all other record data. Use this functionality when the data that you want to write to BigQuery exists in a single map or list-map field within the record.
When you want to use the root field, but do not want to include all fields in the resulting row, configure the target to ignore all specified first-level fields.
The target supports nested data types for Avro or JSON staging files only.
/, as the basis for the resulting row. A record contains the
following
fields:{
"name": "Jane Smith",
"id": "557",
"address": {
"street": "101 3rd St",
"city": "Huntsville",
"state": "NC",
"zipcode": "27023"
}
}The target treats the address map field as a field with an invalid
data type, processing the field as an error record by default. You can configure the
target to ignore the field and process the remaining record data, as described in
Missing fields and fields with invalid types.
Missing fields and fields with invalid types
By default, the target treats records with missing fields or with invalid data types in fields as error records.
You can configure the target to ignore missing fields or fields with invalid data types, replacing the data in the field with an empty value.
The default for each data type is an empty string, which is the default null value in
BigQuery. You can specify a different default value to use for each data type on the Data Defaults
tab. For example, you might define the default value for a missing String field or a String field
with an invalid data type as none or not_applicable.
To configure the target to ignore missing fields and fields with invalid data types, select the Ignore Missing Fields and the Ignore Fields with Invalid Types properties on the Data Defaults tab.
Performance optimization
Use the following tips to optimize for performance and cost-effectiveness when using the Google BigQuery target:
- Increase the batch size
- The maximum batch size is determined by the source in the flow and typically has a default value of 1,000 records. To take advantage of the loading abilities that Google BigQuery provides, increase the maximum batch size in the flow source to 20,000-50,000 records. Be sure to increase the Data Collector java heap size, as needed.
- Use multiple threads
- You can use multiple threads to improve performance when you include a multithreaded source in the flow. When Data Collector resources allow, using multiple threads enables processing multiple batches of data concurrently.
- Enable additional connections to Google BigQuery
- When loading data into multiple tables or datasets, increase the number of connections that the target makes to Google BigQuery. Each additional connection allows the target to load data into an additional table, concurrently.
CRUD operation processing
The Google BigQuery target can insert, update, upsert, or delete data when you configure the target to process CDC data. When processing CDC data, the target merges data into BigQuery tables.
sdc.operation.type record header attribute. The target
performs operations based on the following numeric values:- 1 for INSERT
- 2 for DELETE
- 3 for UPDATE
- 4 for UPSERT
If
your flow has a CRUD-enabled source that processes changed data, the
target simply reads the operation type from the sdc.operation.type
header attribute that the source generates. If your flow has a non-CDC
source, you can use the Expression Evaluator processor or a scripting processor to
define the record header attribute. For more information about Data Collector changed data processing and a list of
CDC-enabled sources, see Processing changed data.
BigQuery data types
The Google BigQuery target converts Data Collector data types into BigQuery data types before writing data to BigQuery tables.
When you configure the target to compensate for data drift, you can also configure the target to create all new columns as String. However, by default, the target converts record data to the appropriate data type.
The target converts the following Data Collector data types into these BigQuery data types:
| Data Collector Data Type | BigQuery Data Type |
|---|---|
| Boolean | Boolean |
| Byte | Bytes |
| Byte Array | Bytes |
| Character | String |
| Date | Date |
| Datetime | Datetime or Timestamp |
| Decimal | Decimal |
| Double | Float |
| Float | Float |
| Integer | Integer |
| List |
Repeated mode column of the specified datatype. Supported for Avro staging files only. For more information, see Staging file formats. |
| List-Map |
Struct or Record Repeated mode column with subfields created from the key-value pair. Supported for Avro staging files only. For more information, see Staging file formats. |
| Long | Integer |
| Map |
Struct or Record Repeated mode column with subfields created from the key-value pair. Supported for Avro staging files only. For more information, see Staging file formats. |
| Short | Integer |
| String | String |
| Time | Time |
Configuring a Google BigQuery target
About this task
Configure a Google BigQuery target to load data to BigQuery tables. Before you use the target in a flow, complete the prerequisite tasks.