Table of contents

Execution Engine for Apache Hadoop environments

You can create Hadoop environment definitions and Jupyter Enterprise Gateway (JEG) sessions in Watson Studio analytic projects to run jobs on the Hadoop cluster.

Service Execution Engine for Apache Hadoop environments are not available by default. An administrator must install the Execution Engine for Apache Hadoop service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

Use a Hadoop environment when you want to do the following tasks:

Note: A Hadoop environment can be used with a Jupyter notebook but it does not apply to notebooks in JupyterLab.

  • Train a model on the Hadoop cluster in a Jupyter notebook.
  • Manage a model on the Hadoop cluster with Hadoop integration utility methods within a Jupyter notebook.
  • Preview and refine Hadoop data (HDFS, Hive and Impala) in Watson Studio.
  • Run Data Refinery flows on the Hadoop cluster.
  • Schedule Python or R scripts as jobs to run on Hadoop clusters remotely.

If you are using Livy to execute Spark jobs on Hadoop, don’t use the Hadoop environment. Instead, use the environment that executes locally in the Cloud Pak for Data cluster.

Hadoop environment definitions

To create a Hadoop environment definition:

  1. From the Environments tab in your project, click New environment definition.
  2. Enter a name and a description.
  3. Select the Hadoop environment configuration type.
  4. Select one of the registered Hadoop systems in the Hadoop configuration drop down.
    • If the job leverages the Python packages from Watson Studio Local and Cloud Pak for Data to the Hadoop cluster, you can choose the image that was pushed from the Hadoop registration page in the Software version field.
  5. Use the YARN Queue field to select the queue that your environment will be running in. This applies to jobs and notebook runs.
    • By default, all executions are submitted to the default YARN queue. The Hadoop admin can configure and expose the list of available yarn queues that you can use.
    • The Hadoop admin should have granted the Watson Studio Local user the necessary permissions to submit the job against the specified YARN queue.
  6. Select the size of the environment that you’ll be running your notebook or jobs with.
  7. After you have saved the new environment, you can select it as an environment for notebooks and jobs..