Accelerating data science with Jupyter Notebooks and Apache Spark

Share this post:

Melissa Rodriguez Zynda The newly integrated notebooks within the IBM Analytics for Apache Spark service aim to provide analysts and data scientists with an iterative, flexible environment that supports end-to-end analysis. Melissa Rodriguez Zynda, IBM Offering Manager for Analytics Platform Services, has previously worked as a design researcher for IBM Watson Analytics and Social Media Analytics. A designer by training with a background in anthropology, Melissa explains how notebooks transform the way users interact with technologies such as Apache Spark to rapidly analyze data.

Melissa, thanks for speaking with us. What is the “Analytic Experience” project, and how did it come about?

The Analytic Experience, the team that I’m a member of, helps drive a wider conversation that we are having within IBM around the user experience for our cloud data and analytic services. Bluemix already offers a huge range of cloud services, and one of the next steps is to coordinate those services into comprehensive workflows that map to the evolving ways that people work.

Within the data and analytics space, we really approached it from the perspective of thinking about the ecosystem as a whole. We wanted to make it as easy as possible for users to find appropriate services for their needs, access tutorials and examples to help them get up-and-running, and integrate everything together in a seamless journey. This was where the Analytic Experience project began and where we started looking for ways to make the process of analysis feel natural and organic. As a first step, we wanted to lower the barrier to entry for users who wanted to leverage one of the hottest technologies in the big data space – our managed Spark service, IBM Analytics for Apache Spark.

What does the Analytic Experience deliver for people using Spark?

Apache Spark is an incredibly powerful technology for processing data at scale using in-memory technology that splits workloads across a distributed architecture and returns the results very quickly. It’s powerful, but it’s also a pain to set up on your own, so our Spark service takes that headache away. While working with the Spark team, we realized we had an opportunity to make the user experience richer by providing IPython or Jupyter notebooks together with the Spark service, as a way to enable people to practice analytics in an intuitive, iterative, robust way.

What is a Jupyter Notebook, and why are notebooks such a good way of interacting with Spark?

Jupyter Notebooks, an iteration of IPython Notebooks, are web applications that allow you to combine code, visualizations, text, and rich media into a single document that can be easily shared with other people in your organization. Effectively, these notebooks facilitate rapid, iterative end-to-end analysis in a single artifact. Instead of working with Python or Scala on the command line, you can perform your analysis within the code cells of the notebook, generate visualizations, and annotate and document your work – all within a web browser interface.

When you change your code and re-run the code cell, it updates the output – so it’s very easy to experiment with data and develop prototype analyses. And with the power of our managed Spark service, the results can be up to 100 times faster than a traditional distributed framework, even when you’re working with a huge data set.

When you’re ready to share your findings with a wider audience, you can either leave the code cells visible – so it’s very easy for other data scientists to understand and reproduce your methods – or remove them, if you’re presenting to a less technical or executive-level audience.

Notebooks make it very easy to experiment with data and develop prototype analyses. And with the power of our managed Spark service, the results can be up to 100 times faster than a traditional distributed framework, even when you’re working with a huge data set.

So that’s what you get with Jupyter as standard – but what has your team done to further enrich this experience?

In line with the whole philosophy of Analytic Experience, we’re really focusing on making the user journey seamless – from data management and curation, through the iterative processes of analysis and data science, to operationalizing the analysis output.

Part of what we’ve done to make the actual process of analysis easier for our users is the introduction of the palettes panel to the notebooks. Here you can do things like edit the notebook metadata, see the environment variables, and take advantage of having many of the more popular libraries – like matplotlib and pandas – pre-installed and ready to use in your notebook. The panel also allows users to connect and reference data sources with the same ease – so I can use the panel to connect my notebook to any supported data source, and then use the “insert to code” function to insert my credentials and start using that data. If I’m not a user of any of those data sources, I can easily use the Bluemix Object Storage service and start dragging and dropping data right into my notebook panel.

We’re also continuing to add to a repository of sample and tutorial notebooks, accessible directly in the UI and from our Learning Center, to help people get started and illustrate the breadth of what can be done in a notebook. There are also a host of data sets available in our Data Exchange that you can access and use to supplement an analysis, such as energy demand statistics or weather data.

Why did you choose Jupyter Notebooks specifically as an entrypoint for Spark?

Jupyter Notebooks are a mature, open source technology, and they have become extremely popular with data scientists, so there are already over 150,000 Jupyter notebooks on GitHub that people can use as a starting point for their own analyses. Having a vast community of support is really important for new adopters, because there are so many resources available, both inside and outside IBM, that can help you take your first steps with notebooks and Spark and raise your analytics game!

Great! So what’s the next step if I want to learn more about how to leverage these integrated notebooks and Spark on Bluemix?

Just try it! When you sign up on Bluemix, you get a 30-day free trial of many of our services, including Spark with Jupyter notebooks. If you’re looking for inspiration, our Learning Center contains sample notebooks as well as sample applications that take you from the basics to making a fully-fledged application using other services like Watson Tone Analyzer. We also attend Spark Summit regularly and host IBM Datapalooza, so come say hi if you’re in the neighborhood!

More Community stories
April 30, 2019

Introducing IBM Analytics Engine v1.2 and Announcing the Deprecation of IBM Analytics Engine v1.0

We are excited to inform you about the new version of IBM Analytics Engine v1.2 that will be available starting May 15, 2019. Along with this release, Analytics Engine v1.0 will be retired.

Continue reading

April 16, 2019

Announcing the Deprecation of the Decision Optimization Beta Service

The End of Beta date for the Decision Optimization service is May 17, 2019. The End of Beta Support date is June 20, 2019.

Continue reading

April 2, 2019

Data Refinery and Profiling Changes in Watson Studio and Watson Knowledge Catalog

We'd like to announce data refinery and profiling changes related to Watson Studio and Watson Knowledge Catalog that will take effect on May 17, 2019.

Continue reading