Refining HDFS data using Data Refinery

Refine data on the Hadoop Distributed File System (HDFS) in the Hadoop cluster.

Prerequisite

Create the connection to the Hadoop cluster. See HDFS via Execution Engine for Hadoop connection.

Restriction

The Data Refinery flow source and target and the Hadoop runtime environment must reference the same Hadoop system.

Procedure

  1. Add data from the HDFS via Execution Engine for Hadoop connection. See Adding data to Data Refinery.
  2. Create the Data Refinery flow by applying operations to the data:
  3. Save the Data Refinery flow and run a job for it. See Managing Data Refinery flows.
    For the target output, you can use the HDFS via Execution Engine for Hadoop connection or a connected data asset from a HDFS via Execution Engine for Hadoop connection.
    Important: You must specify the file format for the output file.

Known issues

Troubleshooting Hadoop environments

Learn more

Parent topic: Refining data on the Hadoop cluster