Analyzing Apache Hadoop data (Execution Engine for Apache Hadoop)
You can build and train models on a Hadoop cluster. If you have data in a Hive or HDFS storage system on a Hadoop cluster, you can work with that data directly on the Hadoop cluster.
Service The Execution Engine for Apache Hadoop service is not available by default. An administrator must install this service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.
Within a project with Watson Studio, you can find Hadoop environment templates on the Environments page. See Hadoop environments.
You can use Hadoop environments in these ways:
- You can train a model on the Hadoop cluster by selecting a Hadoop environment in a Jupyter notebook.
- You can manage a model on the Hadoop cluster by running Hadoop integration utility methods within a Jupyter notebook.
- You can run Data Refinery flows on the Hadoop cluster by selecting a Hadoop environment for the Data Refinery job.
This diagram shows how data scientists working in a project on a Cloud Pak for Data cluster can train a notebook on a Hadoop cluster with data on the Hadoop cluster.
Learn more
Parent topic: Analyzing data and building models