Hadoop integration
The Watson Studio Local Hadoop Integration Service is a registration service that can be installed on a Hadoop edge node to allow Watson Studio Local Version 1.2 or later clusters to securely access data residing on the Hadoop cluster, submit interactive Spark jobs, build models, and schedule jobs that run as a YARN application on the Hadoop cluster.
A Watson Studio Local administrator can then perform the following tasks in the Admin Console:
- Register a Hadoop cluster
- View details about the registered Hadoop cluster
- Push runtime images to the registered Hadoop cluster
- Work with images on the Hadoop cluster
Register a Hadoop cluster
In the Admin Console, click the menu icon (
) and click Hadoop Integration to register your Hadoop clusters and
create images for virtual runtime environments on them. When a Hadoop registration service is
registered in Watson Studio Local, depending on the configuration of the
registered Hadoop system, HDFS and Hive data sources are automatically created. Watson Studio Local users can then list available registered Hadoop Livy endpoints to
be used with Jupyter, RStudio, and Zeppelin, work with data sources and Livy, and select registered
Hadoop images as workers to remotely submit Python jobs to them.
To register the endpoint to a new Hadoop cluster installation, click Add Registration. Name the registration and provide the authentication information.
kubectl logs of the utils-api pod.View details about the registered Hadoop cluster
In the Details page of each registration, you can view its endpoints and runtimes.
Depending on the services that are exposed when installing and configuring the Hadoop registration service on the Hadoop edge node, the details page lists the WebHDFS, WebHCAT, Livy for Spark and Livy for Spark 2 URLs exposed by the Hadoop registration.
- If WebHDFS and/or WebHCAT services are exposed in conjunction with the Livy service, Watson Studio Local users can work with the data sources associated with these services without having to explicitly create them.
- If Livy for Spark and/or Livy for Spark 2 services are exposed, Watson Studio Local users can list these endpoints through
dsx_core_utilsanddsxCoreUtilsR, and use them as defined Livy endpoint in Jupyter, Rstudio and Zeppelin notebooks. Python syntax:
R syntax:%python import dsx_core_utils; dsx_core_utils.list_dsxhi_livy_endpoints();library('dsxCoreUtilsR'); dsxCoreUtilsR::listDSXHILivyEndpoints()
If a registered endpoint changes, you can refresh the registration.
Push runtime images to the registered Hadoop cluster
- To be able to push Jupyter GPU images, Watson Studio Local 1.2.3.1 Patch12 must be installed.
- The Watson Studio Local and Hadoop cluster must be on the same platform architecture for working with the runtime images on Hadoop.
The Watson Studio Local administrator can view the default images and the custom images created by Watson Studio Local users and push or replace the image on the Hadoop cluster. To push a runtime image to the registered Hadoop cluster from its details page, click Push next to the image. Note that pushing the image can take a long time. If you modified any of the runtime images locally, you can update it on the remote cluster by clicking Replace Image next to it.
Runtimes can have the following statuses:
- Available on Watson Studio Local, but not pushed to the registered Hadoop cluster. Users can either push or refresh the environment to the registered Hadoop cluster.
- Pending transfer from Watson Studio Local to the registered Hadoop cluster.
- Failed transfer from Watson Studio Local to the registered Hadoop cluster.
- Available on the registered Hadoop cluster. Watson Studio Local users can select the remote image as a worker, select a Target host, and submit jobs to it.
Work with images on the Hadoop cluster
Watson Studio Local users can work with the pushed images in the notebooks and jobs environment. Example scenario:
- The Watson Studio Local user configures their worker page to select the custom image for the worker.
- The Watson Studio Local user creates a script run job using the customized worker, and selects the registered Hadoop system as the target host.
- The Watson Studio Local user runs the job, selecting the registered Hadoop system as the target host.
import dsx_core_utils
dsx_core_utils.get_dsxhi_info(showSummary=True)
myConfig={
"queue": "default",
"driverMemory": "4G",
"numExecutors": 3,
"conf" : {"livy.rsc.server.connect.timeout":"300s"}
};
dsx_core_utils.setup_livy_sparkmagic(system='edge', livy='livyspark2', imageId='arrow-730-dsx-scripted-ml-python2',addlConfig=myConfig)
%reload_ext sparkmagic.magics
import sparkmagic.utils.configuration as sm_conf
sm_conf.override(sm_conf.livy_session_startup_timeout_seconds.__name__, 300)
%spark add -s session01 -l python -u https://9.87.654.321:8443/gateway/9.87.654.322/livy2/v1 -k
%%spark
...data analysis...
%spark delete -s session-01 -u https://9.87.654.321:8443/gateway/dx-sv-123-45/livy/v1 -k