Google Big Query
The Google Big Query origin reads data from a Google BigQuery table or view. Use the origin in Databricks or Dataproc cluster pipelines only. To use the origin in Databricks clusters, you must configure specific Spark properties.
When you configure the origin, you specify the dataset and the name of the table or view. The origin reads the entire table or view by default. You can configure the origin to process only the specified columns. You can also limit the query by defining a filter condition to include in a WHERE clause.
You indicate if the origin should run in incremental mode or full query mode. When running in incremental mode, you define the offset column and initial offset.
You can specify the number of workers that the origin uses to read from BigQuery.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.
Incremental and Full Query Mode
The Google Big Query origin can run in full query mode or in incremental mode. By default, the origin runs in full query mode.
When the origin runs in full query mode, the origin processes all available data with each batch of data.
When the origin runs in incremental mode, the origin reads all available data with the first batch of data, starting from the specified initial offset. For subsequent batches, the origin reads data that arrived since the prior batch. If you stop the pipeline, the origin starts processing data from the last-saved offset when you restart the pipeline, unless you reset the pipeline offsets.
Offset Column and Supported Types
When you run the Google Big Query origin in incremental mode, you specify an offset column and initial offset.
The offset column should contain unique, incremental values and should not contain null values. The origin does not process records with null offset values.
- Date, Timestamp
- Any supported numeric type, including Numeric and Int64
- String
BigQuery Data Types
The following table lists the BigQuery data types that the Google Big Query origin supports and the Spark data types that Transformer converts them to.
BigQuery Data Type | Spark Data Type |
---|---|
Array | Array |
Bool | Boolean |
Bytes | Binary |
Date | Date |
Datetime | String |
Float64 | Double |
Int64 | Long |
Numeric | Decimal |
String | String |
Struct | Struct |
Time | Long, microseconds since midnight |
Timestamp | Timestamp |
Configuring a Big Query Origin
Use the Google Big Query origin to read from a BigQuery table or view. Use the origin in Databricks or Dataproc cluster pipelines only. To use the origin in Databricks clusters, you must configure specific Spark properties.