Refining HDFS data using Data Refinery
Refine data on the Hadoop Distributed File System (HDFS) in the Hadoop cluster.
Prerequisite
Create the connection to the Hadoop cluster. See HDFS via Execution Engine for Hadoop connection.
Restriction
The Data Refinery flow source and target and the Hadoop runtime environment must reference the same Hadoop system.
Procedure
- Add data from the HDFS via Execution Engine for Hadoop connection. See Adding data to Data Refinery.
- Create the Data Refinery flow by applying operations to the data:
- Save the Data Refinery flow and run a job for it. See Managing Data Refinery flows.
For the target output, you can use the HDFS via Execution Engine for Hadoop connection or a connected data asset from a HDFS via Execution Engine for Hadoop connection.Important: You must specify the file format for the output file.
Known issues
Learn more
Parent topic: Refining data on the Hadoop cluster