Kudu Lookup
The Kudu Lookup processor performs lookups in a Kudu table and passes the lookup values to fields. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
Use the Kudu Lookup to enrich records with additional data. For example, you can configure the processor to use a department_ID field as the primary key column to look up department name values in a Kudu table, and pass the values to a new department_name output field.
When you configure Kudu Lookup, you specify the connection information for one or more Kudu primary nodes and define the table to use. You define the key columns to look up and define the output fields to write the lookup values to. You can also enable Kerberos authentication.
When a lookup results in multiple matches, the Kudu Lookup processor can return the first matching value or return all matching values in separate records.
To improve pipeline performance, you can configure the Kudu Lookup processor to locally cache the Kudu table information and the lookup values returned from a Kudu table.
You can also configure operation timeouts and the maximum number of worker threads to use.
You can also use a connection to configure the processor.
Column Mappings
When you configure the Kudu Lookup processor, you define the following column mappings:
- Key Columns Mapping
- Define the incoming fields in the record that map to the primary key column or columns in the Kudu table. The primary key for a Kudu table can be either a simple key consisting of a single column or a compound key consisting of multiple columns. Click the Add icon to add multiple columns for a compound primary key.
- Columns to Output Fields Mapping
- Define the columns to look up and the fields in the record to map the column values to. You can optionally define a default value to use when the lookup does not return a value for the field.
For example, the following image shows a Kudu Lookup processor that looks up values in a
clients
table that has a compound primary key consisting of
id
and name
. The processor maps the incoming
client_id
and client_name
record fields to the
primary keys in the table. The processor returns the values of the
address
and start_year
columns, and passes the
values to the new client_address
and client_start_year
output fields in the record:
Kudu Data Types
The Kudu Lookup processor converts Kudu data types to the following compatible Data Collector data types:
Kudu Data Type | Data Collector Data Type |
---|---|
Binary | Byte Array |
Bool | Boolean |
Decimal | Decimal |
Double | Double |
Float | Float |
Int8 | Byte |
Int16 | Short |
Int32 | Integer |
Int64 | Long |
String | String |
Unixtime_micros | Datetime The Kudu Unixtime_micros data type stores microsecond values. When converting to the Data Collector Datetime data type, the processor divides the field value by 1,000 to convert the value to milliseconds, and then converts the value to Datetime. |
Lookup Cache
To improve pipeline performance, you can configure the Kudu Lookup processor to locally cache Kudu table information and the lookup values returned from a Kudu table.
When you stop the pipeline, the processor clears both caches.
Cache Table Information
By default, the Kudu Lookup processor locally caches information about each Kudu table to look up, including the table name and schema.
You can configure the maximum number of tables that the processor caches information for. When the maximum number is reached, the processor evicts the oldest values from the cache.
Disable the caching of table information only when you expect the Kudu table schemas to change frequently. In this situation, you want the processor to fetch the updated schemas from Kudu, rather than use an outdated schema in the cache.
To configure the maximum number of tables that can be cached, configure the Maximum Table Entries to Cache property. To disable caching Kudu tables, clear the Enable Table Caching property. Both properties are on the Lookup tab of the processor.
Cache Lookup Values
By default, the processor does not cache the lookup values returned from a Kudu table. To improve pipeline performance, you can enable the processor to locally cache the lookup values.
When enabled, the processor caches values until the cache reaches the maximum size or the expiration time. When the first limit is reached, the processor evicts values from the cache.
- Size-based eviction
- Configure the maximum number of values that the processor caches. When the maximum number is reached, the processor evicts the oldest values from the cache.
- Time-based eviction
- Configure the amount of time that a value can remain in the cache without being written to or accessed. When the expiration time is reached, the processor evicts the value from the cache. The eviction policy determines whether the processor measures the expiration time since the last write of the value or since the last access of the value.
Kerberos Authentication
You can use Kerberos authentication to connect to a Kudu cluster. When you use Kerberos authentication, Data Collector uses the Kerberos principal and keytab to connect to Kudu. By default, Data Collector uses the user account who started it to connect.
The Kerberos principal and keytab are defined in the Data Collector configuration file,
$SDC_CONF/sdc.properties
. To use Kerberos authentication, configure all Kerberos properties in the Data Collector
configuration file.
For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication in the Data Collector documentation.
Configuring a Kudu Lookup Processor
Configure a Kudu Lookup processor to perform lookups in Kudu.