Execution Engine for Apache Hadoop environments

You can create Hadoop environment definitions and Jupyter Enterprise Gateway (JEG) sessions in Watson Studio analytic projects to run jobs on the Hadoop cluster.

Service Execution Engine for Apache Hadoop environments are not available by default. An administrator must install the Execution Engine for Apache Hadoop service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

Use a Hadoop environment when you want to do the following tasks:

Note: A Hadoop environment can be used with a Jupyter notebook but it does not apply to notebooks in JupyterLab.

Train a model on the Hadoop cluster in a Jupyter notebook.
Manage a model on the Hadoop cluster with Hadoop integration utility methods within a Jupyter notebook.
Preview and refine Hadoop data (HDFS, Hive and Impala) in Watson Studio.
Run Data Refinery flows on the Hadoop cluster.
Schedule Python or R scripts as jobs to run on Hadoop clusters remotely.

If you are using Livy to execute Spark jobs on Hadoop, don’t use the Hadoop environment. Instead, use the environment that executes locally in the Cloud Pak for Data cluster.

Hadoop environment definitions

To create a Hadoop environment definition:

From the Environments tab in your project, click New environment definition.
Enter a name and a description.
Select the Hadoop environment configuration type.
Select one of the registered Hadoop systems in the Hadoop configuration drop down.
- If the job leverages the Python packages from Watson Studio Local and Cloud Pak for Data to the Hadoop cluster, you can choose the image that was pushed from the Hadoop registration page in the Software version field.
Use the YARN Queue field to select the queue that your environment will be running in. This applies to jobs and notebook runs.
- By default, all executions are submitted to the default YARN queue. The Hadoop admin can configure and expose the list of available yarn queues that you can use.
- The Hadoop admin should have granted the Watson Studio Local user the necessary permissions to submit the job against the specified YARN queue.
Select the size of the environment that you’ll be running your notebook or jobs with.
After you have saved the new environment, you can select it as an environment for notebooks and jobs..

Adding user settings

When you are working with large datasets in Hadoop or you need to fine tune your Spark session, use the User defined session variables. The variables are parameters that help you define additional Spark options that can be used as part of your notebook launch or executing a job.

Before you can use the variables, your Hadoop admin must first define the list of available options and the value range for the options as part of configuring Hadoop. Contact your Hadoop admin to learn what options are available for you to configure. After you add the new options, it takes effect after you launch a new notebook or run a job.

To add new parameters to your Hadoop environment definition:

In the User defined session variables section, click New session variable.
Select the parameters and values.