Apache HDFS connection

To access your data in Apache HDFS, create a connection asset for it.

Apache Hadoop Distributed File System (HDFS) is a distributed file system that is designed to run on commodity hardware. Apache HDFS was formerly Hortonworks HDFS.

Prerequisite for Kerberos authentication

To use Kerberos authentication, the data source must be configured for Kerberos and the service that you plan to use the connection in must support Kerberos. For information, see Enabling platform connections to use Kerberos authentication.

Create a connection to Apache HDFS

To create the connection asset, you need these connection details. The WebHDFS URL is required.
The available properties in the connection form depend on whether you select Connect to Apache Hive so that you can write tables to the Hive data source.

  • WebHDFS URL to access HDFS.
  • Hive host: Hostname or IP address of the Apache Hive server.
  • Hive database: The database in Apache Hive.
  • Hive port number: The port number of the Apache Hive server. The default value is 10000.
  • Hive HTTP path: The path of the endpoint such as gateway/default/hive when the server is configured for HTTP transport mode.
  • SSL certificate (if required by the Apache Hive server).

Authentication method

You can choose Kerberos credentials or Username and password.

  • For Kerberos credentials, you must complete the prerequisite for Kerberos authentication and you need the following connection details. You also need these details for Hive if you selected Connect to Apache Hive:

    • Service principal name (SPN) that is configured for the data source.
    • User principal name to connect to the Kerberized data source.
    • The keytab file for the user principal name that is used to authenticate to the Key Distribution Center (KDC).
  • For Username and password, you also provide values for the Hive user and password if you connect to Apache Hive.

For Credentials and Certificates, you can use secrets if a vault is configured for the platform and the service supports vaults. For information, see Using secrets from vaults in connections.

Choose the method for creating a connection based on where you are in the platform

In a project
Click Assets > New asset > Data access tools > Connection. See Adding a connection to a project.
In a catalog
Click Add to catalog > Connection. See Adding a connection asset to a catalog.
In a deployment space
Click Add to space > Connection. See Adding connections to a deployment space.
In the Platform assets catalog
Click New connection. See Adding platform connections.

Next step: Add data assets from the connection

Where you can use this connection

You can use Apache HDFS connections in the following workspaces and tools:

Projects

  • Data quality rules (Watson Knowledge Catalog)
  • Data Refinery (Watson Studio or Watson Knowledge Catalog)
  • DataStage (DataStage service). See Connecting to a data source in DataStage.
  • Metadata enrichment (Watson Knowledge Catalog)
  • Metadata import (Watson Knowledge Catalog)
  • SPSS Modeler (SPSS Modeler service)

Catalogs

  • Platform assets catalog
  • Other catalogs (Watson Knowledge Catalog)

Data discovery and Data quality

  • Automated discovery

Apache HDFS setup

Install and set up a Hadoop cluster

Supported file types

The Apache HDFS connection supports these file types:  Avro, CSV, Delimited text, Excel, JSON, ORC, Parquet, SAS, SAV, SHP, and XML.

Learn more

Apache HDFS Users Guide

Parent topic: Supported connections