Delta Lake Lookup
The Delta Lake Lookup processor performs a lookup on a Delta Lake table. The processor can return the first matching row, all matching rows, a count of matching rows, or a boolean value that indicates whether a match was found.
When you configure the Delta Lake Lookup processor, you specify the path to the lookup table, and you can enable time travel to query older versions of the table. You configure the record field to use and the table column to match against. You also specify the operator to use. You select the information to return, then configure related properties.
When returning one or more records, you specify the columns to return and optionally define a prefix for the resulting field names to prevent adding duplicate fields to the record. You can specify columns to sort by and the sort order. When returning multiple rows, you can specify a maximum number of rows to return.
When returning a count or boolean value, you define a name for the field to contain the results. If the field does not exist, the processor creates it.
You configure the storage system for the table. When using a table stored on Azure Data Lake Storage (ADLS) Gen2, you also specify connection-related details. For a table on Amazon S3 or HDFS, Transformer uses connection information stored in a Hadoop configuration file. You can configure security for connections to Amazon S3.
If the lookup table is static, you can configure the processor to load the table only once, enabling the processor to cache and reuse the data for the duration of the pipeline run.
If not loading only once, and if the processor passes data to multiple stages, you might enable caching to improve pipeline performance.
To access a table stored on ADLS Gen2, complete the necessary prerequisites before you run the pipeline. Also, before you run a local pipeline for a table on ADLS Gen2 or Amazon S3, complete these additional prerequisite tasks.
Storage Systems
- Amazon S3
- Azure Data Lake Storage (ADLS) Gen2
- HDFS
- Local file system
ADLS Gen2 Prerequisites
- If necessary, create a new Azure Active Directory
application for Transformer.
For information about creating a new application, see the Azure documentation.
- Ensure that the
Azure Active Directory Transformer application
has the appropriate access control to perform the necessary tasks.
To read from Azure, the Transformer application requires Read and Execute permissions. If also writing to Azure, the application requires Write permission as well.
For information about configuring Gen2 access control, see the Azure documentation.
- Install the Azure Blob File System driver on the cluster where the pipeline
runs.
Most recent cluster versions include the Azure Blob File System driver,
azure-datalake-store.jar
. However, older versions might require installing it. For more information about Azure Data Lake Storage Gen2 support for Hadoop, see the Azure documentation. - Retrieve
Azure Data Lake Storage Gen2 authentication information from the
Azure portal for configuring the processor.
You can skip this step if you want to use Azure authentication information configured in the cluster where the pipeline runs.
- Before using the stage in a local pipeline, ensure that Hadoop-related tasks are complete.
Retrieve Authentication Information
The Delta Lake Lookup processor provides several ways to authenticate connections to ADLS Gen2. Depending on the authentication method that you use, the processor requires different authentication details.
If the cluster where the pipeline runs has the necessary Azure authentication information configured, then that information is used by default. However, data preview is not available when using Azure authentication information configured in the cluster.
You can also specify Azure authentication information in stage properties. Any authentication information specified in stage properties takes precedence over the authentication information configured in the cluster.
- OAuth
- When connecting using OAuth authentication, the processor requires the
following information:
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Application Key - Authentication key for the Azure
Active Directory Transformer application. Also known as the client key.
For information on accessing the application key from the Azure portal, see the Azure documentation.
- OAuth Token Endpoint - OAuth 2.0 token endpoint for
the Azure Active Directory v1.0 application for Transformer. For example:
https://login.microsoftonline.com/<uuid>/oauth2/token
.
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
- Managed Service Identity
- When connecting using Managed Service Identity authentication, the processor
requires the following information:
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Tenant ID - Tenant ID for the Azure Active Directory
Transformer
application. Also known as the directory ID.
For information on accessing the tenant ID from the Azure portal, see the Azure documentation.
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
- Shared Key
- When connecting using Shared Key authentication, the processor requires the
following information:
- Account Shared Key - Shared access key that Azure
generated for the storage account.
For more information on accessing the shared access key from the Azure portal, see the Azure documentation.
- Account Shared Key - Shared access key that Azure
generated for the storage account.
Amazon S3 Credential Mode
- Instance profile
- When Transformer runs on an Amazon EC2 instance that has an associated instance profile, Transformer uses the instance profile credentials to automatically authenticate with AWS.
- AWS access keys
- When Transformer does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you can authenticate using an AWS access key pair. When using an AWS access key pair, you specify the access key ID and secret access key to use.
- None
- When accessing a public bucket, you can connect anonymously using no authentication.
Using a Local File System
- On the Cluster tab of the pipeline properties, set Cluster Manager Type to None (Local).
- On the General tab of the stage properties, set Stage Library to Delta Lake Transformer-provided libraries.
- On the Delta Lake tab, for the Table Directory Path property, specify the directory to use.
- On the Storage tab, set Storage System to HDFS.
Configuring a Delta Lake Lookup Processor
Configure a Delta Lake Lookup processor to perform lookups on a Delta Lake table.
Complete the necessary prerequisites before performing lookups on a table stored on ADLS Gen2. Also, before you run a local pipeline for a table on ADLS Gen2 or Amazon S3, complete these additional prerequisite tasks.