Microsoft Azure Data Lake Storage connection

To access your data in Microsoft Azure Data Lake Storage, create a connection asset for it.

Azure Data Lake Storage (ADLS) is a scalable data storage and analytics service that is hosted in Azure, Microsoft's public cloud. The Microsoft Azure Data Lake Storage connection supports access to both Gen1 and Gen2 Azure Data Lake Storage repositories.

Create a connection to Microsoft Azure Data Lake Storage

To create the connection asset, you need these connection details:

  • WebHDFS URL: The WebHDFS URL for accessing HDFS.
    To connect to a Gen 2 ADLS, use the format, https://<account-name>.dfs.core.windows.net/<file-system>
    Where <account-name> is the name you used when you created the ADLS instance.
    For <file-system>, use the name of the container you created. For more information, see the Microsoft Data Lake Storage Gen2 documentation.

  • Tenant ID: The Azure Active Directory tenant ID
  • Client ID: The client ID for authorizing access to Microsoft Azure Data Lake Storage
  • Client secret: The authentication key that is associated with the client ID for authorizing access to Microsoft Azure Data Lake Storage

Select Server proxy to access the Azure Data Lake Storage data source through a proxy server. Depending on its setup, a proxy server can provide load balancing, increased security, and privacy. The proxy server settings are independent of the authentication credentials and the personal or shared credentials selection. The proxy server settings cannot be stored in a vault.

  • Proxy host: The proxy URL. For example, https://proxy.example.com.
  • Proxy port number: The port number to connect to the proxy server. For example, 8080 or 8443.
  • The Proxy protocol selection for HTTP or HTTPS is optional.

For Credentials, you can use secrets if a vault is configured for the platform and the service supports vaults. For information, see Using secrets from vaults in connections.

Choose the method for creating a connection based on where you are in the platform

In a project
Click Assets > New asset > Data access tools > Connection. See Adding a connection to a project.
In a catalog
Click Add to catalog > Connection. See Adding a connection asset to a catalog.
In a deployment space
Click Add to space > Connection. See Adding connections to a deployment space.
In the Platform assets catalog
Click New connection. See Adding platform connections.

Next step: Add data assets from the connection

Where you can use this connection

You can use Microsoft Azure Data Lake Storage connections in the following workspaces and tools:

Projects

  • Data quality rules (Watson Knowledge Catalog)
  • DataStage (DataStage service). See Connecting to a data source in DataStage.
  • Metadata enrichment (Watson Knowledge Catalog)
  • Metadata import (Watson Knowledge Catalog)
  • Notebooks (Watson Studio). Click Read data on the Code snippets pane to get the connection credentials and load the data into a data structure. See Load data from data source connections.
  • SPSS Modeler (SPSS Modeler service)

Catalogs

  • Platform assets catalog

  • Other catalogs (Watson Knowledge Catalog)

Watson Query service
You can connect to this data source from Watson Query. Only the following file types are supported with this connection in Watson Query:
  • CSV
  • TSV
  • ORC
  • Parquet
  • JSON

Federal Information Processing Standards (FIPS) compliance

The Microsoft Azure Data Lake Storage connection is compliant with FIPS. However, SSL certificates that you paste into the SSL certificate field are not supported. As a workaround, you can add the certificate to the OpenShift secret named connection-ca-certs. See Using a CA certificate to connect to internal servers from the platform for the procedure.

Azure Data Lake Storage authentication setup

To set up authentication, you need a tenant ID, client (or application) ID, and client secret.

Supported file types

The Microsoft Azure Data Lake Storage connection supports these file types: Avro, CSV, Delimited text, Excel, JSON, ORC, Parquet, SAS, SAV, SHP, and XML.

Learn more

Azure Data Lake

Parent topic: Supported connections