Technical Blog Post
Abstract
Lifecycle of notebook services in Platform Conductor
Body
Notebooks are popular open source web applications that provide data integration, visualization, and statistical modeling functionalities. IBM Platform Conductor for Spark (Platform Conductor) integrates notebook functionalities, and takes advantage of a notebook’s graphical user interface to manipulate and visualize data.
Platform Conductor is designed to have notebooks as long running services. Once a user is assigned a notebook, such as Zeppelin or IPython-Jupyter, the user will be able to use notebook features, such as data analytics and data visualization.
So how does a notebook work in Platform Conductor?
The flow chart below demonstrates the lifecycle of a Spark instance group, from creating a notebook type to deleting that notebook type. In this post, we are going to take a closer look at the process within the dotted lines: assigning and unassigning users to notebooks, and using notebooks.
When a user is assigned to a notebook, Platform Conductor allocates resources to create the notebook service, then dispatches the notebook service to an available node in the cluster. When the notebook service starts, it is listed in the “DEFINED” state. The diagram below illustrates the lifecycle of a notebook service. States are defined inside each circle (in white), and the action that triggers the transition of each state is outside each circle (in green).
Once started, the notebook service quickly transitions to the “STARTED” state, meaning the notebook daemon started successfully and the notebook is ready for the assigned user to use it. You can now launch the notebook from the browser.
Here’s a sample interface of the built-in Zeppelin notebook:
Notebooks introduce the notion of “interpreters”, which are plug-ins for language and data processing. Zeppelin notebooks, for example, support Scala, Python, SparkSQL, Hive, Markdown, and Shell interpreters.
Platform Conductor provides high availability (HA) for failover purposes. When enabling HA, the user is required to provide a “shared file system” location. As the user interacts with the notebook interface, Platform Conductor saves all notebook data to the “shared file system”. This ensures a continuous session for the user. Even if the notebook browser is closed, or the host on which the notebook service runs goes down, the notebook content can be retrieved upon opening the same notebook again.
Sample code for Zeppelin notebook
The following is an example of a Spark application that calculates the value of Pi:
val count = sc.parallelize(1 to 1000000000).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / 1000000000)
More sample code can be found in the following Zeppelin tutorial:
https://zeppelin.incubator.apache.org/docs/0.5.0-incubating/tutorial/tutorial.html
After a short while, the result is outputted as follows:
Note that Platform Conductor monitors the cluster CPU, memory, executor, host, and other usage statistics at all times for analytics. These functionalities are enabled through the ELK stack integration.
To learn more about notebooks, check out the IBM Knowledge Center. To learn more about integrating and using a third-party notebook with Platform Conductor, see the IPython-Jupyter notebook integration with Platform Conductor blog post.
If you'd like to try out Platform Conductor, download an evaluation version from our Service Management Connect page. If you have any questions or require other notebook samples, please leave a comment in our forum!
UID
ibm16163833




