HBase Lookup
The HBase Lookup processor performs key-value lookups in HBase and passes the lookup values to fields. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
Use the HBase Lookup to enrich records with additional data. For example, you can configure the processor to use a department_ID field as the key to look up department name values in HBase, and pass the values to a new department_name output field.
When you configure the HBase Lookup processor, you specify whether the processor performs a bulk lookup of all keys in a batch, or performs an individual lookup of each key in a record. You define the key to look up in HBase, and specify the output field to write the lookup values to.
You can configure the processor to locally cache the key-value pairs to improve performance.
You also specify the HBase configuration properties, including the ZooKeeper Quorum, parent znode, and table name. When necessary, you can enable Kerberos authentication, specify an HBase user, and add additional HBase configuration properties.
Lookup Key
When you define the lookup key, you specify the row and optionally the column and timestamp to look up in HBase.
The following table describes each lookup parameter that you can use to define the lookup key:
Lookup Parameter | Description |
---|---|
Row | The row to look up in HBase. |
Column | The column of the row to use. The column must use the following
format:
|
Timestamp | The timestamp associated with the row and column. The timestamp must be a Datetime type. |
- Row, Column, and Timestamp
- When you define all of the lookup parameters, HBase Lookup processor returns the value of the specified row, column, and timestamp. The processor passes a single String value to the output field.
- Row and Column
- When you define the row and column lookup parameters, HBase Lookup processor returns the value of the specified row and column with the most recent timestamp. The processor passes a single String value to the output field.
- Row and Timestamp
- When you define the row and timestamp lookup parameters, HBase Lookup processor looks up all values of the row in all columns with the specified timestamp. The processor passes a map of String values that contain the HBase column family, qualifier, and value for the specified row.
- Row
- When you define only the row lookup parameter, HBase Lookup processor looks up all values of the row in all columns with the most recent timestamp. The processor passes a map of String values that contain the HBase column family, qualifier, and value for the specified row.
Lookup Cache
To improve pipeline performance, you can configure the HBase Lookup processor to locally cache the key-value pairs returned from HBase.
The processor caches key-value pairs until the cache reaches the maximum size or the expiration time. When the first limit is reached, the processor evicts key-value pairs from the cache.
- Size-based eviction
- Configure the maximum number of key-value pairs that the processor caches. When the maximum number is reached, the processor evicts the oldest key-value pairs from the cache.
- Time-based eviction
- Configure the amount of time that a key-value pair can remain in the cache without being written to or accessed. When the expiration time is reached, the processor evicts the key from the cache. The eviction policy determines whether the processor measures the expiration time since the last write of the value or since the last access of the value.
When you stop the pipeline, the processor clears the cache.
Kerberos Authentication
You can use Kerberos authentication to connect to HBase. When you use Kerberos authentication, Data Collector uses the Kerberos principal and keytab to connect to HBase. By default, Data Collector uses the user account who started it to connect.
The
Kerberos principal and keytab are defined in the Data Collector configuration file,
$SDC_CONF/sdc.properties
. To use Kerberos authentication, configure all Kerberos properties in the Data Collector
configuration file.
For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication in the Data Collector documentation.
Using an HBase User
Data Collector can either use the currently logged in Data Collector user or a user configured in the processor to look up data in HBase.
A Data Collector configuration property can be set that requires using the currently logged in Data Collector user. When this property is not set, you can specify a user in the origin. For more information about Hadoop impersonation and the Data Collector property, see Hadoop Impersonation Mode in the Data Collector documentation.
Note that the processor uses a different user account to connect to HDFS. By default, Data Collector uses the user account who started it to connect to external systems. When using Kerberos, Data Collector uses the Kerberos principal.
- On HBase, configure the user as a proxy user and authorize the user to
impersonate the HBase user.
For more information, see the HBase documentation.
- In the HBase Lookup processor, enter the HBase user name.
HDFS Properties and Configuration File
You can configure the HBase Lookup processor to use individual HDFS properties or HDFS configuration files:
- HBase configuration file
- You can use the following HDFS configuration file with the HBase
configuration file:
- hbase-site.xml
- Individual properties
- You can configure individual HBase properties in the HBase Lookup processor.
To add an HBase property, you specify the exact property name and the value.
The HBase Lookup processor does not validate the property names or
values.Note: Individual properties override properties defined in the HBase configuration file.
Configuring an HBase Lookup Processor
Configure an HBase Lookup processor to perform key-value lookups in HBase.