Machine learning models on a remote Apache Hadoop cluster in Jupyter Python

You can easily build and train models by using distributed Spark compute on a remote Hadoop cluster with secure access to the data. The access to the Hadoop cluster should be set up with the Execution Engine for Hadoop service to the Hadoop cluster. You can either use Jupyter Enterprise Gateway (JEG) or Livy to access Hadoop Spark to build and train models.

Tasks that you can perform:

Using the Python environment

For both JEG and Livy, you can use the model frameworks available in Watson Studio when you build models on Apache Hadoop. The admin should have pushed the image for the Python environment to the Hadoop cluster.

Hadoop utilities library

The hadoop_utils_lib and hi_core_utils libraries provide useful functions that can be called within a Jupyter notebook that helps in working with models. See Using Hadoop utilities for Python for more information.

Building models with JEG

Before you build models with JEG, you must create a Hadoop environment. The settings of the environments control:

  • Target Hadoop system
  • Pushed Python environment used for the execution
  • YARN queue against which the Spark YARN job is submitted
  • Spark resources allocated to the session are controlled by the settings in the environment

Perform the following tasks to build and train a model using JEG:

Start a Jupyter notebook

Open or create a Jupyter notebook. In the Select Runtime drop down, choose the Hadoop Environment. This starts a JEG kernel securely with your user identity and therefore the access to the data and resources on the Hadoop cluster is restricted based on your authorization.

If you've upgraded to Cloud Pak for Data versions 4.7.0, 4.7.1, or 4.7.2, you must run the following commands as the first cell after the kernel starts:

from pyspark.sql import SparkSession
from pyspark import SparkContext
spark = SparkSession.builder.getOrCreate()
sc = SparkContext.getOrCreate()

Using the shell action, you can run commands on the YARN nodemanager. For example, you can use the following command:

!hostname -f
!hdfs dfs -ls /user

Work with data

You can read data from an HDFS file or Hive table by using the spark session.

HDFS file

	df_data_1 = spark.read.format(
	    "org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").option(
	    "header", "true").option("inferSchema", "true").load("hdfs:///user/cars.csv)

Hive table

	df_data_1 = spark.sql("select * from cars")

You can perform operations on the spark dataframe to transform and manipulate data.

Build and train a model

You can split the data for training and evaluation and choose the model framework and algorithm to build and evaluate the model.

Save the model to HDFS

You can save the model to HDFS by calling the python function in the hi_core_utils library hi_core_utils.write_model_to_hdfs(model=model, model_name="modelname")

This call outputs the location where the model is saved on HDFS. A tar.gz archive of the model will be saved in the location as well.

Transfer models to Watson Studio

To transfer the model saved on HDFS, you should open a notebook with a Default Environment and call the Hadoop utility function to transfer the file from HDFS to Watson Studio.

hdfs_model_dir = "/user/{your username}/.dsxhi/models/{modelname}/{modelversion}/model.tar.gz"
wsl_model_path = "/project_data/data_asset/model.tar.gz"
hadoop_lib_utils.download_hdfs_file(webhdfs_endpoint, hdfs_model_dir, wsl_model_path)

Building models with Livy

Start a Jupyter notebook

Open or create a Jupyter notebook. In the Select Runtime drop down, choose the Default Python 3.10 environment. Note that you should not choose a Hadoop environment.

Create a remote Livy session

You should import the hadoop_lib_utils python library and call the get_dsxhi_info utility method to get information about the registered Hadoop systems on the Watson Studio cluster, the available services, and the Python environments that are pushed to HDFS. Use the following command:

HI_SYSTEMS = hadoop_lib_utils.get_dsxhi_info(showSummary=True)

You can create a config object to specify additional spark configurations, such as YARN queue, driver memory, executor memory, and number of executors. Use the following command:

myConfig={
 "queue": "default",
 "driverMemory": "1G",
 "numExecutors": 2
};

You should call the setup_livy_sparkmagic function to set up the authentication, Hadoop cluster information and additional configs for the Livy connection. If you want to use a pushed Python environment, you can specify it in the imageId parameter:

HI_CONFIG = hadoop_lib_utils.setup_livy_sparkmagic(
  system="systemName	",
  livy="livyspark2",
  addlConfig=myConfig,
  imageId="jupyter-231n-py")

# (Re-)load sparkmagic to apply the new configs.
%reload_ext sparkmagic.magics
To create a Livy session, run
session_name = mysession
livy_endpoint = HI_CONFIG['LIVY']
%spark add -s $session_name -l python -k -u $livy_endpoint
Note: Hadoop Execution Engine (HEE) does not support any Livy Spark sessions that are created using the Runtime 24.1 on Python 3.11 for CPD 5.0.0.

This starts a YARN application securely with your user identity and therefore the access to the data and resources on the Hadoop cluster is restricted based on your authorization. If you see the call return a 500 error, you should check the Resource Manager UI on the Hadoop cluster. If the YARN application was created successfully, you can ignore the 500 error.

Once a Livy session is established, to run code on the Hadoop cluster, the first line in the cell should include %%spark -s $session_name. Cells that do not start with %%spark run locally on the Watson Studio cluster.

Work with data

You can read data from an HDFS file or Hive table by using the spark session.

HDFS file

%%spark -s $session_name
df_data_1 = spark.read.format(
    "org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").option(
    "header", "true").option("inferSchema", "true").load("hdfs:///user/cars.csv)

Hive table

%%spark -s $session_name
df_data_1 = spark.sql("select * from cars")

You can perform operations on the spark dataframe to transform and manipulate data.

Build and train a model

You can split the data for training and evaluation and choose the model framework and algorithm to build and evaluate the model.

Save the model to HDFS

You can save the model to HDFS by calling the python function in the hi_core_utils library by using the following call:

%%spark -s $session_name
hi_core_utils.write_model_to_hdfs(model=model, model_name="modelname")

This call outputs the location where the model is saved on HDFS.

Transfer models to Watson Studio

You can use the Hadoop utility function to transfer the model HDFS to Watson Studio. Note that this cell should run locally on the Watson Studio cluster and should not start with %%spark.

hdfs_model_dir = "/user/{your username}/.dsxhi/models/{modelname}/{modelversion}/model.tar.gz"
wsl_model_path = "/project_data/data_asset/model.tar.gz"
hadoop_lib_utils.download_hdfs_file(webhdfs_endpoint, hdfs_model_dir, wsl_model_path)

Clean up the Livy session

Clean up the Livy session to the Hadoop Spark to free up the YARN resources that are held on the remote Hadoop system by calling %spark cleanup.

High Availability

Any active JEG or Livy sessions that resided on the failed node must be restarted and must be run again.

Load Balancing

JEG and Livy sessions are allocated with sticky sessions, following an active/passive approach. A session runs on the same Execution Engine for Apache Hadoop service node until a failure is detected, at which point all new sessions are allocated to the next available Execution Engine for Apache Hadoop service node.

Parent topic: Apache Hadoop