ADLS Gen2
The ADLS Gen2 origin reads data from Microsoft Azure Data Lake Storage Gen2. Every file must be fully written, include data of the same supported format, and use the same schema.
When reading multiple files in a batch, the origin reads the oldest file first. Upon successfully reading a file, the origin can delete the file, move it to an archive directory, or leave it in the directory.
When the pipeline stops, the origin notes the last-modified timestamp of the last file that it processed and stores it as an offset. When the pipeline starts again, the origin continues processing from the last-saved offset by default. When needed, you can reset pipeline offsets to process all available files.
Before you use the origin, you must perform some prerequisite tasks.
When you configure the ADLS Gen2 origin, you specify the Azure authentication method to use and related properties. Or, you can have the origin use Azure authentication information configured in the cluster where the pipeline runs.
You configure the directory path to use and a name pattern for the files to read. The origin reads the files with matching names in the specified directory and its subdirectories. If the origin reads partition files grouped by field, you must specify the partition base path to include the fields and field values in the data. You can also configure a file name pattern for a subset of files to exclude from processing. You specify the data format of the data, related data format properties, and how to process successfully read files. When needed, you can define a maximum number of files to read in a batch.
You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.
Prerequisites
- If necessary, create a new Azure Active Directory
application for Transformer.
For information about creating a new application, see the Azure documentation.
- Ensure that the
Azure Active Directory Transformer application
has the appropriate access control to perform the necessary tasks.
To read from Azure, the Transformer application requires Read and Execute permissions. If also writing to Azure, the application requires Write permission as well.
For information about configuring Gen2 access control, see the Azure documentation.
- Install the Azure Blob File System driver on the cluster where the pipeline
runs.
Most recent cluster versions include the Azure Blob File System driver,
azure-datalake-store.jar
. However, older versions might require installing it. For more information about Azure Data Lake Storage Gen2 support for Hadoop, see the Azure documentation. - Retrieve Azure
Data Lake Storage Gen2 authentication information from the Azure
portal for configuring the origin.
You can skip this step if you want to use Azure authentication information configured in the cluster where the pipeline runs.
- Before using the stage in a local pipeline, ensure that Hadoop-related tasks are complete.
Retrieve Authentication Information
The ADLS Gen2 origin provides several ways to authenticate connections to Azure. Depending on the authentication method that you use, the origin requires different authentication details.
If the cluster where the pipeline runs has the necessary Azure authentication information configured, then that information is used by default. However, data preview is not available when using Azure authentication information configured in the cluster.
You can also specify Azure authentication information in stage properties. Any authentication information specified in stage properties takes precedence over the authentication information configured in the cluster.
- OAuth
- When connecting using OAuth authentication, the origin requires the
following information:
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Application Key - Authentication key for the Azure
Active Directory Transformer application. Also known as the client key.
For information on accessing the application key from the Azure portal, see the Azure documentation.
- OAuth Token Endpoint - OAuth 2.0 token endpoint for
the Azure Active Directory v1.0 application for Transformer. For example:
https://login.microsoftonline.com/<uuid>/oauth2/token
.
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
- Managed Service Identity
- When connecting using Managed Service Identity authentication, the origin
requires the following information:
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Tenant ID - Tenant ID for the Azure Active Directory
Transformer
application. Also known as the directory ID.
For information on accessing the tenant ID from the Azure portal, see the Azure documentation.
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
- Shared Key
- When connecting using Shared Key authentication, the origin requires the
following information:
- Account Shared Key - Shared access key that Azure
generated for the storage account.
For more information on accessing the shared access key from the Azure portal, see the Azure documentation.
- Account Shared Key - Shared access key that Azure
generated for the storage account.
Schema Requirement
All files processed by the ADLS Gen2 origin must have the same schema.
When files have different schemas, the resulting behavior depends on the data format and the version of Spark that you use. For example, the origin might skip processing delimited files with a different schema, but add null values to Parquet files with a different schema.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.
- Delimited, JSON, text, or XML
- When reading text-based files, Spark can split the files into multiple partitions for processing, depending on the underlying file system. Multiline JSON files cannot be split, so are processed in a single partition.
- Avro, ORC, or Parquet
- When reading Avro, ORC, or Parquet
files, Spark can split the file into multiple partitions for processing.
Spark uses these partitions while the pipeline processes the batch unless a processor causes Spark to shuffle the data. To change the partitioning in the pipeline, use the Repartition processor.
Data Formats
The ADLS Gen2 origin generates records based on the specified data format.
The origin can read the following data formats:
- Avro
- The origin generates a record for every Avro record in an Avro container file. Each file must contain the Avro schema. The origin uses the Avro schema to generate records.
- Delimited
- The origin generates a record for each line in a delimited file. You can specify a custom delimiter, quote, and escape character used in the data.
- JSON
- By default, the origin generates a record for each line in a JSON Lines file. Each line in the file should contain a valid JSON object. For details, see the JSON Lines website.
- ORC
- The origin generates a record for each row in an Optimized Row Columnar (ORC) file.
- Parquet
- The origin generates records for every Parquet record in the file. The file must contain the Parquet schema. The origin uses the Parquet schema to generate records.
- Text
- The origin generates a record for each line in a text file. The file must
use
\n
as the newline character. - XML
- The origin generates a record for every row defined in an XML file. You specify the root tag used in files and the row tag used to define records.
Configuring an ADLS Gen2 Origin
Configure an ADLS Gen2 origin to read files in Azure Data Lake Storage Gen2. Before you use the origin in a pipeline, complete the required prerequisites. Before using the origin in a local pipeline, complete these additional prerequisites.