IBM Data Science Experience Local and other external notebook integrations with IBM Spectrum Conductor

Technical Blog Post

Abstract

Body

Use case: Running workload from external notebooks in IBM Spectrum Conductor with Spark (hereby called IBM Spectrum Conductor)

Notebooks are a very popular way to develop new applications. IBM Spectrum Conductor, internally, provides an extensive range of notebooks, as well as a pluggable interface to extendable and custom notebook packages. In some use cases and for different reasons, users might use external notebooks that run as standalone servers or inside other products. In this case, the user can still benefit by submitting Spark workload from the external notebook to IBM Spectrum Conductor to take advantage of this product’s cluster performance and scalability in the secure multi-tenant environment.

IBM Spectrum Conductor provides native remote network interfaces for Spark workload submission and monitoring in “client” mode (which is the integrative way that is required by most notebooks). External notebooks cannot always use client mode directly since this approach requires having a Spark instance group-specific remote client deployment to be installed on that external host and be locally accessible to the external notebook server. In this case, to avoid exporting the Spark instance group client to the external host, you can use the RESTful servers that are provided by the Apache Livy project to create a RESTful end-point per individual Spark instance group and then use its RESTful API client from inside the notebooks.

This blog describes the procedure and provides end-to-end examples on how to enable the external notebook usage use case described above. The image below is an architectural diagram that describes the use case of integrating external notebooks from IBM DSX Local and submitting Spark workload to a Spark instance group in an IBM Spectrum Conductor cluster.

IBM DSX Local provides a great front end and IBM Spectrum Conductor is an enterprise grade Spark engine that supports high scale, multi tenancy, and intelligent scheduling policies.

Procedure:

1. You can use either an existing Spark instance group or create a new one to be used with the external notebook (or group of notebooks). Note that workload submitted by all the external notebooks to the same RESTful end-point (Apache Livy server) are assigned to a single corresponding Spark instance group and follow its user’s authentication and user impersonation settings, defined resource sharing policies, etc. Then you can deploy multiple Apache Livy servers as different IBM Spectrum Conductor's livy application instances, each assigned to separate Spark instance groups.

2. Use the sample “IBM Conductor application for integration with Apache Livy” from IBM Spectrum Conductor public GitHub on IBM Cloud. Follow its readme (README.md) to deploy the IBM Spectrum Conductor's livy application instance that is assigned to a specific existing Spark instance group. As a result, you should have a Livy instance up and running and its outputs value (available on the Overview tab in the cluster management console) shows the end-point location as livy_URL. For example:

3. The Apache Livy RESTful API can be used either directly or programmatically. Then, for external notebook usage, the common and easy way is to install the RESTful API integration in the notebook as a “sparkmagic” extension (see open-source project at https://github.com/jupyter-incubator/sparkmagic ). This extension has a prebuilt library package that is available from external open-source repositories (e.g. EPEL) and in most common notebook cases can be installed directly from inside the notebook.

For example: In the Jupyter notebook running Python kernel, the sparkmagic can be installed from inside by using the following command:

!pip install sparkmagic

4. To access the Spark instance group by using the Apache Livy end-point, you need to load the client library, create a Livy session, and then use it for the Spark job submission. The “sparkmagic” command helps to automate the process.

For example: For Jupyter notebook use the following steps:

a) Load the “sparkmagic” extension.

%load_ext sparkmagic.magics

b) Create a Livy session by using the livy_URL value from the application instance. You can perform this either through the notebook interface by typing:

%manage_spark

or explicitly in a command line by entering for example:

%spark add -s <session> -l python –u <livy_URL> -a u –k

c) After the Livy session is created, it can be used to specify any Spark code under the notebook cell; starting with magic %%spark –l <language> (where the language for the Livy session can be one of Python, Scala, or R).

For example:

%%spark

sc.parallelize(range(1000)).count()

d) Finally, the Livy session has to be cleaned up to release the associated resource:

%spark cleanup

e) To see all the available options for the sparkmagic extension inside the notebook, you can type help as:

spark?

f) The GitHub for the “sparkmagic” project includes a few Jupyter notebooks that illustrate its usage. Also, the IBM DSX Local product documentation provides manual and sample notebooks for connection to a remote Spark cluster, which can be used to communicate with IBM Spectrum Conductor by using this Livy application instance integration.

g) The Apache Livy server has its own security settings (https, etc.), which are disabled by default and can be enabled in the livy-0.4.0-incubating-bin/conf/livy.conf file under the IBM Spectrum Conductor's Livy application instance deployment location (see the Apache Livy documentation for details). From the IBM Spectrum Conductor side, the security mode that is enabled in the Spark instance group that is associated with the corresponding Livy application instance, might require to pass the Spark user’s authentication parameters by using the IBM Spectrum Conductor's specific spark properties; such as spark.ego.uname, spark.ego.passwd, or spark.ego.credentials. For the “sparkmagic” extension, the Spark properties can be easy configured either in the notebook server startup configuration or interactively entered in the notebook itself. For example, in Jupyter, the authentication properties can be added to the Livy session configuration in the JSON format:

%%spark config

{"conf": {"spark.ego.uname": "Admin","spark.ego.passwd": "Admin" }}

5. On the IBM Spectrum Conductor cluster side, the Livy sessions list and details can be accessed in the Livy UI. To get directly to the Livy UI, you just need to click on the livy_URL value string in the Overview tab of the Livy application instance in the cluster management console. The Livy UI provides full details on each interaction inside the session, including input and output returned:

6. Each Livy session starts its own Spark driver in the corresponding Spark instance group, so all Spark metrics for the application can be viewed by the Spark instance group’s scheduler in the cluster management console, as well as the Spark open-source UI.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS4H63","label":"IBM Spectrum Conductor"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16163515

Tips

IBM Data Science Experience Local and other external notebook integrations with IBM Spectrum Conductor

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?