JDBC Lookup
The JDBC Lookup processor uses a JDBC connection to perform lookups in a database table and pass the lookup values to fields. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
Use the JDBC Lookup to enrich records with additional data. For example, you can configure the processor to use a department_ID field as the column to look up department name values in a database table, and pass the values to a new department_name output field.
When a lookup results in multiple matches, the JDBC Lookup processor can return the first matching value, all matching values in a list in a single record, or all matching values in separate records.
When you configure JDBC Lookup, you specify connection information and custom JDBC configuration properties to determine how the processor connects to the database. You configure the SQL query to define the data to look up in the database, specify the output fields to write the lookup values to, and choose the multiple match behavior.
You can specify the behavior for when the lookup returns no values and optionally configure a default value for the same situation. You can also specify the behavior for when the processor encounters data of an unexpected type.
You can configure the processor to locally cache the lookup values to improve performance. When caching lookup values, you can also enable using additional threads to prepopulate the lookup cache and further increase performance.
By default, the JDBC Lookup processor requires all table columns defined in the column mappings to exist when the pipeline starts. You can configure the processor to skip this validation.
The processor generates JDBC field attributes that provide additional information about each field.
To use a JDBC version older than 4.0, you can specify the driver class name and define a health check query.
You can also use a connection to configure the processor.
Database Vendors and Drivers
The JDBC Lookup processor can perform lookups of database data from multiple database vendors.
Database Vendor | Supported Versions | Tested Versions |
---|---|---|
MySQL | MySQL 5.7 and later |
|
Oracle |
Also supported:
|
|
PostgreSQL | PostgreSQL 9.x and later |
|
Microsoft SQL Server |
|
|
MySQL Data Types
The JDBC Lookup processor converts MySQL data types into Data Collector data types.
MySQL Data Type | Data Collector Data Type |
---|---|
Bigint | Long |
Bigint Unsigned | Decimal |
Binary | Byte Array |
Blob | Byte Array |
Char | String |
Date | Date |
Datetime | Datetime |
Decimal | Decimal |
Double | Double |
Enum | String |
Float | Float |
Int | Integer |
Int Unsigned | Long |
Json | String |
Linestring | Byte Array |
Medium Int | Integer |
Medium Int Unsigned | Long |
Numeric | Decimal |
Point | Byte Array |
Polygon | Byte Array |
Set | String |
Smallint | Short |
Smallint Unsigned | Integer |
Text | String |
Time | Time |
Timestamp | Datetime |
Tinyint, Tinyint Unsigned | Short |
Varbinary | Byte Array |
Varchar | String |
Year | Date |
Oracle Data Types
The JDBC Lookup processor converts Oracle data types into Data Collector data types.
Oracle Data Type | Data Collector Data Type |
---|---|
Number | Decimal |
Char | String |
Varchar, Varchar2 | String |
Nchar, NvarChar2 | String |
Binary_float | Float |
Binary_double | Double |
Date | Datetime |
Timestamp | Datetime |
Timestamp with time zone | Zoned_datetime |
Timestamp with local time zone | Zoned_datetime |
Long | String |
Blob | Byte_array |
Clob | String |
Nclob | String |
XMLType | String |
PostgreSQL Data Types
The JDBC Lookup processor converts PostgreSQL data types into Data Collector data types.
PostgreSQL Data Type | Data Collector Data Type |
---|---|
Bigint | Long |
Boolean | Boolean |
Bytea | Byte Array |
Char | String |
Date | Date |
Decimal | Decimal |
Double Precision | Double |
Enum | String |
Integer | Integer |
Money | Double |
Numeric | Decimal |
Real | Float |
Smallint | Short |
Text | String |
Time, Time with Time Zone | Time |
Timestamp, Timestamp with Time Zone | Time |
Varchar | String |
SQL Server Data Types
The JDBC Lookup processor converts SQL Server data types into Data Collector data types.
SQL Server Data Type | Data Collector Data Type |
---|---|
Bigint | Long |
Binary | Byte_Array |
Bit | Boolean |
Char | String |
Date | Date |
Datetime, Datetime2 | Datetime |
Datetimeoffset | Zoned_datetime |
Decimal | Decimal |
Float | Double |
Image | Byte_Array |
Int | Integer |
Money | Decimal |
Nchar | String |
Ntext | String |
Numeric | Decimal |
Nvarchar | String |
Real | Float |
Smalldatetime | Datetime |
Smallint | Short |
Smallmoney | Decimal |
Text | String |
Time | Time |
Tinyint | Short |
Varbinary | Byte_Array |
Varchar | String |
XML | String |
Unsupported Data Types
- Stops the pipeline
- If the stage encounters an unsupported data type, the stage stops the
pipeline after completing the processing of the previous records and
displays the following
error:
JDBC_37 - Unsupported type 1111 for column.
- Converts to string
- If the stage encounters an unsupported data type, the stage converts the data to string when possible, and then continues processing. Not all unsupported data types can successfully be converted to string. When using this option, verify that the data is converted to string as expected.
Installing the JDBC Driver
You install the driver into the JDBC stage library, streamsets-datacollector-jdbc-lib
, which includes the processor.
To use the JDBC driver
with multiple stage libraries, install the driver into each stage library associated
with the stages.
For example, if you want to use a MySQL JDBC driver with the JDBC Lookup processor
and with the MySQL Binary Log origin, you install the driver as an external library
for the JDBC stage library, streamsets-datacollector-jdbc-lib
, and for the MySQL Binary Log stage library, streamsets-datacollector-mysql-binlog-lib
.
For information about installing additional drivers, see Install External Libraries in the Data Collector documentation.
Lookup Cache
To improve pipeline performance, you can configure the JDBC Lookup processor to locally cache the values returned from a database table.
The processor caches values until the cache reaches the maximum size or the expiration time. When the first limit is reached, the processor evicts values from the cache.
- Size-based eviction
- Configure the maximum number of values that the processor caches. When the maximum number is reached, the processor evicts the oldest values from the cache.
- Time-based eviction
- Configure the amount of time that a value can remain in the cache without being written to or accessed. When the expiration time is reached, the processor evicts the value from the cache. The eviction policy determines whether the processor measures the expiration time since the last write of the value or since the last access of the value.
When you stop the pipeline, the processor clears the cache.
Using Additional Threads
When using local caching, you can increase the number of threads that the JDBC Lookup processor uses to prepopulate the lookup cache. After the cache is populated, the additional threads are released. This can substantially increase the performance of the processor.
By default, the Minimum Idle Connections property, on the Advanced tab, determines the minimum number of connections to the database that Data Collector creates and maintains.
When you enable using a local lookup cache, the Minimum Idle Connections property can also determine the number of available cores on the Data Collector machine that the processor uses for threads to prepopulate the cache.
- The configured Minimum Idle Connections property.
- The number of available cores on the Data Collector machine, minus one. The processor never uses all available cores.
By increasing the setting for the Minimum Idle Connections property, you can enable the processor to use almost all available cores on the Data Collector machine for additional threads to prepopulate the lookup cache.
For example, say you have 8 available cores on the Data Collector machine when you start a pipeline, and the JDBC Lookup has local caching enabled and the Minimum Idle Connections property set to 8. Then, the JDBC Lookup can use 7 of the available cores for threads to prepopulate the lookup cache. This might be ideal if you have complex lookup processing that you want to accomplish as quickly as possible and do not need to reserve resources for other processing.
To reserve machine resources for other processing, you can limit the cores that are used by the processor by setting the Minimum Idle Connections property to a lower number. For example, if you set Minimum Idle Connections to 5, then the processor can use up to 4 available cores for threads.
Retry Lookups for Missing Values
When you enable local caching, the processor also caches the configured default value when a lookup for a given column fails and a default value is defined for that column in the Column Mappings. The processor then always returns the default value for the column to avoid unnecessary lookups.
You can configure the processor to retry lookups for known missing values by enabling the Retry on Missing Value property. Configure the processor to retry lookups when the lookup table might be updated as the pipeline runs.
For example, if you expect that new values will be inserted in the table as the pipeline runs, you’d want to configure the processor to retry the request rather than returning the cached default value.
JDBC Field Attributes
The JDBC Lookup processor generates field attributes for columns converted to the Decimal or Datetime data types in Data Collector. The attributes provide additional information about each field.
- Decimal and Numeric data types are converted to the Data Collector Decimal data type, which does not store scale and precision.
- The Timestamp data type is converted to the Data Collector Datetime data type, which does not store nanoseconds.
Data Collector Data Type | Generated Field Attribute | Description |
---|---|---|
Decimal | precision | Provides the original precision for every decimal or numeric column. |
Decimal | scale | Provides the original scale for every decimal or numeric column. |
Datetime | nanoSeconds | Provides the original nanoseconds for every timestamp column. |
You can use the record:fieldAttribute
or
record:fieldAttributeOrDefault
functions to access the information
in the attributes. For more information about working with field attributes, see Field Attributes.
Configuring a JDBC Lookup Processor
Configure a JDBC Lookup processor to perform lookups in a database table.