Refining HDFS data using Data Refinery

Refine data on the Hadoop Distributed File System (HDFS) in the Hadoop cluster.

Prerequisite

Create the connection to the Hadoop cluster. See HDFS via Execution Engine for Hadoop connection.

The Data Refinery flow source and target and the Hadoop runtime environment must reference the same Hadoop system.

Add data from the HDFS via Execution Engine for Hadoop connection. See Adding data to Data Refinery.
Create the Data Refinery flow by applying operations to the data:
- GUI operations
- Interactive code templates
Save the Data Refinery flow and run a job for it. See Managing Data Refinery flows.
For the target output, you can use the HDFS via Execution Engine for Hadoop connection or a connected data asset from a HDFS via Execution Engine for Hadoop connection.
Important: You must specify the file format for the output file.

Parent topic: Refining data on the Hadoop cluster