Execution Engine for Apache Hadoop environments

Service This service is not available by default. An administrator must install this service on the Watson Studio platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.

Use the Hadoop environment when you want to do the following tasks:

Run Data Refinery jobs to refine HDFS or Hive data on Hadoop.
Use the Jupyter Enterprise Gateway (JEG) service to run Spark jobs on Hadoop.

If you are using Livy to execute Spark jobs on Hadoop, don’t use the Hadoop environment. Instead, use the environment that executes locally in the Watson Studio Local cluster.

Hadoop environment definitions

To create a Hadoop environment definition:

From the Environments tab in your project, click New environment definition.
Enter a name and a description.
Select the Hadoop environment configuration type.
Select one of the registered Hadoop systems in the Hadoop configuration drop down.
- If the job leverages the Python packages from Watson Studio Local and Cloud Pak for Data to the Hadoop cluster, you can choose the image that was pushed from the Hadoop registration page in the Software version field.
If you want the job to execute against a specific job, you can specify that in the YARN Queue field.
- All jobs are submitted against the default YARN queue.
- The Hadoop admin should have granted the Watson Studio Local user the necessary permissions to submit the job against the specified YARN queue.
Select the executors. All jobs, by default, will use one executor, one core and 1 GB memory for the driver and executor. Depending on the resource requirements and parallelism needed for your job, you can increase the executors, cores and memory for the environment.