Oracle JDBC Table
The Oracle JDBC Table origin reads data from one or more Oracle tables. The origin can read all of the columns from the tables or only specified columns from the tables. Pipelines can contain only one origin configured to read from multiple tables. To read from one or more tables using a custom query, use the JDBC Query origin.
When you configure the Oracle JDBC Table origin, you specify database connection information and any additional JDBC configuration properties you want to use. You can also use a connection to configure the origin.
You configure the tables to read, and optionally define the columns to read from the tables. You specify the offset column and the maximum number of partitions used to read from a database table. The data type of the offset column can limit the number of partitions that the origin can use. You can also configure an additional predicate for the query.
With the origin configured to read from multiple tables, the pipeline processes multiple batches. In batch mode, the pipeline sequentially processes one batch for each table that the origin is configured to read, and then stops. In streaming mode, the pipeline also sequentially processes one batch for each table, but then waits a specified amount of time before repeating the process, starting from the first table once again.
When you configure the origin to read from exactly one table, you can configure the origin to load data only once. With this configuration, the origin only reads data during the first batch of a pipeline run and caches that data for reuse throughout the pipeline run. When you do not configure the origin to load data only once, you can configure the origin to cache data. With this configuration, the origin caches the data from each batch to pass that data efficiently to multiple downstream stages as the pipeline processes the batch. You can also configure the origin to skip tracking offsets.
You can optionally configure advanced properties such as specifying the fetch size, custom offset queries, and the JDBC driver to bundle with the pipeline.
Pipelines with the Oracle JDBC Table origin contain batch headers. In the header, the origin sets the
jdbc.table
attribute, which stores the name of the table that
the origin reads for the batch. When the origin reads from multiple tables, you can
use the attribute to determine the origin's data source for a batch.
Before you use the Oracle JDBC Table origin, install a JDBC driver.
This origin is tested with the Oracle 11g with the Oracle 19.3.0 JDBC driver.
Installing the Oracle JDBC Driver
Before using the Oracle JDBC Table origin, you must install an Oracle JDBC driver. Install the driver as an external library for the JDBC stage library.
If you install a driver provided by Oracle, the origin automatically detects the JDBC driver class name from the configured JDBC connection string. If you install a third-party driver, you must specify the driver class name on the Advanced tab of the origin.
By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.
Offset Column
Unless you configure the origin to skip offset tracking, the Oracle JDBC Table origin tracks an offset for each table that the origin reads.
By default, the origin uses the primary key column for each table as the offset column. However, if any table has a composite key or the data type in a primary key column is not a supported offset data type, you must specify the offset column.
As an alternative to the default, you can configure one offset column for the origin. You must specify a column that exists in each table that the origin reads. The offset column should contain unique, incremental values and should not contain null values. The offset column must also be a supported offset data type.
- Track processing
- The origin tracks processing using values in the offset column. When reading the last row for a batch, the origin saves the value from the offset column. In the subsequent batch, the origin starts reading from the following row.
- Create partitions
- When tracking offsets and creating partitions, the origin determines the data to be processed and then divides the data into partitions based on ranges of offset values.
Supported Offset Data Types
The supported data types for an offset column differ based on the number of partitions that you want the origin to use when reading the data.
Partitions | Supported Offset Data Type |
---|---|
One partition |
|
One or more partitions |
|
Null Offset Value Handling
By default, the Oracle JDBC Table origin does not process records with null offset values. You can configure the origin to process those records by enabling the Partition for NULL property on the Advanced tab.
When you enable the Partition for NULL property, the origin queries the table for rows with null offset values, then groups the resulting records into a single partition. As a result, when the table includes null offset values, each batch of data contains a partition of records with null offset values.
Default Offset Queries
The Oracle JDBC Table origin uses two offset queries to determine the offset values to use when querying the database. The default queries work for most cases. On the rare occasion when you want to produce different results, you can configure custom offset queries to override the default queries.
- Min/max offset query
- This query returns the minimum and maximum values in the offset column. The origin uses these values to read all existing data in a table.
- Max offset query
- This query returns the maximum offset in the offset column. The origin uses this value along with the last-saved offset to read new data that arrived since processing the last batch for a table.
Custom Offset Queries
Configure custom offset queries to override the default offset queries for the Oracle JDBC Table origin.
- Custom Min/Max Query
-
Returns the minimum and maximum value to use as offsets when querying the database. Configure this query to override the default min/max query that determines the first set of records that the origin reads for each table.
- Custom Max Query
- Returns a maximum value to use as an offset when querying the database. Configure this query to override the default max query that the origin uses to read subsequent sets of records from each table.
Specify a custom max query along with the custom min/max query to define a range of data for the pipeline to process, such as data generated in 2019.
For example, say you want to process only the data with
offsets 1000 to 8000, inclusive. And you want the first batch to process one
thousand records. To do this, you configure the custom min/max query to return
1000
and 2000
. This sets the lower boundary of
the data that the origin reads and defines the number of records included in the
first batch. To set the upper boundary of the data that the origin reads, you set
the custom max query to 8000
.
In the first batch, the origin reads records with offsets between 1000 and 2000, inclusive. In the second batch, the origin reads any new records with offsets between 2001 and 8000, inclusive. Now, say the last record in the second batch has an offset value of 2500. Then, in the third batch, the origin reads any new records with offsets between 2501 and 8000, and so on.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.
For the Oracle JDBC Table origin, Spark determines the number of partitions from the maximum number of partitions and the table's partition column. Spark splits the data from the table into the partitions and creates one connection to the database for each partition.
Spark uses these partitions while the pipeline processes the batch unless a processor causes Spark to shuffle the data. To change the partitioning in the pipeline, use the Repartition processor.
- The size and configuration of the cluster.
- The amount of data being processed.
- The number of concurrent connections that can be made to the database.
If the pipeline fails because the origin encounters an out of memory error, you likely need to increase the number of partitions for the origin.
Partition Column Selection
By default, the Oracle JDBC Table origin uses the offset column for the table as a partition column to improve read performance.
- When configured, the origin uses the field specified in the Offset Column property to create partitions.
- When no offset column is specified and the table includes a single-column primary key, the origin uses the primary key column for partitioning.
- When no offset column is specified and the table has a compound primary key instead of a single-column primary key, the origin uses the first key column for partitioning.
- When no offset column is specified and the table has no primary key columns, the origin uses the first datetime or numeric column that is indexed for partitioning.
When the origin cannot locate a partition column, it reads the entire table in a single partition.
Oracle Data Types
The following table lists the Oracle data types that the Oracle JDBC Table origin supports and the Transformer data types they are converted to.
Oracle data types not listed in the table are not supported.
Oracle Data Type | Transformer Data Type |
---|---|
Binary_Float | Float |
Binary_Double | Double |
Blog | Binary |
Char, Clob, Long, Nchar, Nclob, Nvarchar2, Varchar, Varchar2 | String |
Date, Timestamp, Timestamp with Timezone | Timestamp with time zone converted |
Number | Decimal |
Configuring an Oracle JDBC Table Origin
Configure an Oracle JDBC Table origin to read data from one or more Oracle database tables. Before you use the origin, install a JDBC driver.