Configuring sparse lookup operations

You can configure a HBase connector stage to perform sparse (direct) lookup operation on a HBase table.

Before you begin

  • To specify the format of the data records that the HBase connector reads from an HBase table, set up column definitions on a link.
  • Configure the HBase connector as a source for the reference data.

About this task

In a sparse lookup, the connector fetches one record from HBase table for each record that arrives on the input link to the Lookup stage. The input link columns definitions must have one and only one column with the same name and data type as the name and data type of primary key column defined in HBase reference link. Since the name of the primary key column can be arbitrarily chosen by the user it should be simple to match HBase table row key with the corresponding column in the input link. The result of the lookup is routed as one record through the reference link from the HBase connector stage back to the Lookup stage and from the Lookup stage to the output link of the Lookup stage. A sparse lookup is also known as a direct lookup because the lookup is performed directly on the data source

Typically, you use a sparse lookup when the target table is too large to fit in memory. If you use a parallel read option and processing is performed on many player nodes you must ensure that the input data set is also adequately partitioned in relation to the values in the lookup key column.

Procedure

  1. Add a Lookup stage to the job design canvas, and then create a reference link from the HBase Connector stage to the Lookup stage.
  2. Double-click the HBase Connector stage.
  3. From the Lookup Type list, select Sparse.
  4. To save the changes, click OK.
  5. Double-click the Lookup stage.
  6. Ensure that input link and reference link have matching columns. The columns to join input data with the lookup data are chosen automatically and cannot be set with drag and drop mechanism. The column from the input link contain values that are used as input values for the lookup operation.
  7. Map the input link and reference link columns to the output link columns and specify conditions for a lookup failure:
    1. Drag or copy the columns from the input link and reference link to your output link.
    2. To define conditions for a lookup failure, click the Constraints icon in the menu.
    3. In the Lookup Failure column, select a value, and then click OK. If you select Reject, you must have a reject link from the Lookup stage and a target stage in your job configuration to capture the rejected records.
    4. Click OK.
  8. Save, compile, and run the job.