Machine learning models on a remote Apache Hadoop cluster in Jupyter Python

You can easily build and train models by leveraging distributed Spark compute on a remote Hadoop cluster with secure access to the data. The access to the Hadoop cluster should be set up with the Execution Engine for Hadoop service. You can either use Jupyter Enterprise Gateway (JEG) or Livy to access Hadoop Spark to build and train models.

Tasks you can perform:

Using the Python environment

For both JEG and Livy, you can leverage the model frameworks available in Watson Studio when you build models on Apache Hadoop. The admin should have pushed the image for the Python environment to the Hadoop cluster.

Hadoop utilities library

The hadoop_utils_lib and hi_core_utils libraries provide useful functions that can be called within a Jupyter notebook that helps in working with models. See Using Hadoop utilities for Python for more information.

Building models with JEG

Before you build models with JEG, you must create a Hadoop environment. The settings of the environments control:

Perform the following tasks to build and train a model using JEG:

Start a Jupyter notebook

Open or create a Jupyter notebook. In the Select Runtime drop down, choose the Hadoop Environment. This starts a JEG kernel securely with your user identity and therefore the access to the data and resources on the Hadoop cluster is restricted based on your authorization. Using the shell action, you can run commands on the YARN nodemanager. For example, you can use the following command:

!hostname -f
!hdfs dfs -ls /user

Work with data

You can read data from a HDFS file or Hive table using the spark session.

HDFS file

    df_data_1 = spark.read.format(
        "org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").option(
        "header", "true").option("inferSchema", "true").load("hdfs:///user/cars.csv)

Hive table

    df_data_1 = spark.sql("select * from cars")

You can perform operations on the spark dataframe to transform and manipulate data.

Build and train a model

You can split the data for training and evaluation and choose the model framework and algorithm to build and evaluate the model.

Save the model to HDFS

You can save the model to HDFS by calling the python function in the hi_core_utils library hi_core_utils.write_model_to_hdfs(model=model, model_name="modelname")

This call will output the location where the model is saved on HDFS. A tar.gz archive of the model will be saved in the location as well.

Transfer models to Watson Studio

To transfer the model saved on HDFS, you should open a notebook with a Default Environment and call the Hadoop utility function to transfer the file from HDFS to Watson Studio.

hdfs_model_dir = "/user/{your username}/.dsxhi/models/{modelname}/{modelversion}/model.tar.gz"
wsl_model_path = "/project_data/data_asset/model.tar.gz"
hadoop_lib_utils.download_hdfs_file(webhdfs_endpoint, hdfs_model_dir, wsl_model_path)

Building models with Livy

Start a Jupyter notebook

Open or create a Jupyter notebook. In the Select Runtime drop down, choose the Default Python 3.6 environment. Note that you should not choose a Hadoop environment.

Create a remote Livy session

You should import the hadoop_lib_utils python library and call the get_dsxhi_info utility method to get information about the registered Hadoop systems on the Watson Studio cluster, the available services and the Python environments that are pushed to HDFS. Use the following command:

HI_SYSTEMS = hadoop_lib_utils.get_dsxhi_info(showSummary=True)

You can create a config object to specify additional spark configurations, such as YARN queue, driver memory, executor memory, and number of executors. Use the following command:

myConfig={
 "queue": "default",
 "driverMemory": "1G",
 "numExecutors": 2
};

You should call the setup_livy_sparkmagic function to set up the authentication, Hadoop cluster info and additional configs for the Livy connection. If you want to use a pushed Python environment, you can specify it in the imageId parameter:

HI_CONFIG = hadoop_lib_utils.setup_livy_sparkmagic(
  system="systemName    ", 
  livy="livyspark2",
  addlConfig=myConfig,
  imageId="jupyter-py36")

# (Re-)load sparkmagic to apply the new configs.
%reload_ext sparkmagic.magics
To create a Livy session, run
session_name = mysession
livy_endpoint = HI_CONFIG['LIVY']
%spark add -s $session_name -l python -k -u $livy_endpoint

This starts a YARN application securely with your user identity and therefore the access to the data and resources on the Hadoop cluster is restricted based on your authorization. If you see the call return a 500 error, you should check the Resource Manager UI on the Hadoop cluster. If the YARN application was created successfully, you can ignore the 500 error.

Once a Livy session is established, to execute code on the Hadoop cluster, the first line in the cell should include %%spark -s $session_name. Cells that do not start with %%spark execute locally on the Watson Studio cluster.

Work with data

You can read data from a HDFS file or Hive table using the spark session.

HDFS file

%%spark -s $session_name 
df_data_1 = spark.read.format(
    "org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").option(
    "header", "true").option("inferSchema", "true").load("hdfs:///user/cars.csv)

Hive table

%%spark -s $session_name 
df_data_1 = spark.sql("select * from cars")

You can perform operations on the spark dataframe to transform and manipulate data.

Build and train a model

You can split the data for training and evaluation and choose the model framework and algorithm to build and evaluate the model.

Save the model to HDFS

You can save the model to HDFS by calling the python function in the hi_core_utils library using the following call:

%%spark -s $session_name 
hi_core_utils.write_model_to_hdfs(model=model, model_name="modelname")

This call will output the location where the model is saved on HDFS.

Transfer models to Watson Studio

You can use the Hadoop utility function to transfer the model HDFS to Watson Studio. Note that this cell should run locally on the Watson Studio cluster and should not start with %%spark.

hdfs_model_dir = "/user/{your username}/.dsxhi/models/{modelname}/{modelversion}/model.tar.gz"
wsl_model_path = "/project_data/data_asset/model.tar.gz"
hadoop_lib_utils.download_hdfs_file(webhdfs_endpoint, hdfs_model_dir, wsl_model_path)

Clean up the Livy session

Clean up the Livy session to the Hadoop Spark to free up the YARN resources that are held on the remote Hadoop system by calling %spark cleanup.

High Availability

Any active JEG or Livy sessions that resided on the failed node must be restarted and must be run again.

Load Balancing

JEG and Livy sessions are allocated with sticky sessions, following an active/passive approach. A session runs on the same Execution Engine for Apache Hadoop service node until a failure is detected, at which point all new sessions are allocated to the next available Execution Engine for Apache Hadoop service node.

Parent topic: Apache Hadoop