Refining data on the Hadoop cluster in Data Refinery
Take advantage of the Hadoop support for large data sets as you refine data on the Hadoop cluster.
You must use a Hadoop environment to run Data Refinery jobs on a Hadoop cluster. Data Refinery can run flows directly on the Hadoop cluster when using a Hadoop environment for the job. For more information, see Execution Engine for Apache Hadoop environments.
You can use the following Hadoop Execution Engine connections for refining data if you use the Hadoop environment:
- HDFS via Execution Engine for Hadoop for Hadoop Distributed File System (HDFS) files
- Hive via Execution Engine for Hadoop for data that is stored in tables in a Hive warehouse
- Impala via Execution Engine for Hadoop for data that is stored in tables in an Impala on the Hadoop cluster
See also Troubleshooting Hadoop environments.
Parent topic: Refining data