Apache Spark: Upgrade and speed-up your analytics

Share this post:

Chetna Warade_headshot One of the best things about Apache Spark is that it makes real-time analytics of vast unstructured datasets – like social media sites – feasible and affordable for companies of all sizes. But what are the practicalities of performing this kind of analysis? And how would you get started? Chetna Warade, Developer Advocate at IBM, is a software engineer who works in research and product development. She has been issued two patents by the USPTO, has three more pending, and earned the IBM First Invention Plateau Award in 2010. We spoke to Chetna about a recent project to demonstrate the potential of Spark for social media analytics, which focused on the popular “Ask Me Anything” (AMA) section of the social news and entertainment site, Reddit.

Chetna, thank you for joining us. Could you give us a bit of background about the project? What was IBM’s motivation in demonstrating how to analyze Reddit AMAs?

Reddit is an example of a social media site where users’ conversations are moderated by the users themselves. It is often used as a forum for discussing products and services – so many businesses, particularly in the US, are interested in using Reddit as a source of unbiased, unsolicited customer feedback for brand marketing, market research and quality management purposes.

The AMA subreddit is a space where companies can establish a presence and have a spokesperson interact directly with their end customers – but this can be a high-risk strategy. Many corporate AMAs are very well received, but the Reddit community can also react negatively if spokespeople don’t engage with them in an open and approachable way. That’s why social sentiment analysis of the AMA subreddit seemed like a great choice for a demo: it would not only showcase the technical aspects of social media analytics – it would also help us look at how people react to different types of AMA and what kind of behavior gets a positive response. We hoped to gain a better understanding of how companies can use Reddit AMAs to engage with their customers and spread their messages effectively.

How did you put together the demo? What technologies were involved?

To make the demo possible, we needed to bring together a number of data services and analytics tools. IBM Cloud Data Services offers a rich ecosystem of IBM and open source technologies that we can easily integrate together to create innovative cloud-based apps and services, coordinated through the IBM Bluemix cloud application development platform.

To get the data, we wrote a simple data pipe connector in Node.js that allowed us to connect to Reddit and fetch a specified number of comments from an AMA post. We then passed the text to IBM Watson Tone Analyzer – an app on Bluemix that can be easily integrated with Apache Spark using a simple API – and stored the results in a Cloudant database. You can find the connector on Github – it is still a work-in-progress, but it was sufficient for the purposes of our demo. We then used the Cloudant Spark Connector – which is also open source and available on Github – to load the data into Spark. There is a great video tutorial about the Cloudant Spark Connector in our Apache Spark Learning Center, if anyone wants to learn more.

As a front-end, we used a Jupyter Notebook, which allowed us to analyze the results of the whole dataset interactively using Python code. The notebook made it easy to generate statistics and visualizations to gain insight into the tone of the discussion around each AMA post, and assess the overall success of the AMA. The engine for the whole solution is our Spark-as-a-Service offering, IBM Analytics for Apache Spark, which gives us the power to perform rapid, real-time analytics even on the massive datasets that are typical in social media analytics use-cases. The managed Spark service makes it really easy to jump straight in – you can just start up a notebook and begin performing iterative analysis without any complex setup or up-front costs.

What did the demo prove, from a technical perspective and a business perspective?

First, we demonstrated that the IBM Cloud Data Services portfolio now brings together all the services you need to capture and analyze data from Reddit in real time. With our Spark service, the results are available almost instantly, so you could even potentially analyze an AMA session while it was in progress! This could make it a very valuable tool for businesses who want to use AMAs as a way of interacting directly with their customers.

Second, we showed how the combination of tools such as notebooks and Watson Tone Analyzer with Spark provides a relatively simple way for data scientists to analyze and report on social media data, sidestepping all of the major obstacles of trying to do this kind of analysis with a traditional programming stack. Instead of having to perform a complex data modeling process up-front, you can experiment and iterate your analysis dynamically in the notebook interface, and still get quick results because of the power of Spark in dealing with unstructured data.

Our Apache Spark service is like a Porsche. It’s not just that the Spark engine is powerful – it’s that the whole experience is designed to make it easy for users go as fast as they need to.Chetna Warade, IBM Developer Advocate

There’s an analogy I sometimes like to use: the traditional tools that developers use – integrated development environments like Eclipse and Visual Studio – are like a Ford truck. They’re great for heavy-duty jobs and day-to-day activities. But they aren’t built for speed, and that’s what data scientists need when they’re dealing with complex data and need to deliver results to the business quickly. By contrast, our Apache Spark service is like a Porsche. It’s not just that the Spark engine is powerful – it’s that the whole experience is designed to make it easy for users go as fast as they need to.

The ability to offer a user experience that is designed for data scientists, not purely for traditional Java stack developers, is a real step forward in opening up big data analytics to a much wider audience. In my own development and data science work, I really feel like this is a huge upgrade – it is so much easier to code and work fast because our Spark service provides a user experience that enables me to get the job done quickly.

From a business perspective, I think the main lesson is that we’re approaching a stage where business users will be able to do this kind of analysis on their own, without support from IT. Of course, it’s a journey: notebooks are still the domain of the data scientist at present, not the average business user. But the key message is that social sentiment analysis need not require a major IT implementation any more – it’s something that your data scientists can start working on without needing a big investment in software or infrastructure.

And finally, where can I learn more about how to leverage Spark and other technologies for social media analytics?

You can read our guide to Getting started with Analytics for Apache Spark and sign up for IBM Bluemix (it’s free!). Once you see the potential, we can scale up our Spark service to whatever size your business needs.

More Community stories
April 30, 2019

Introducing IBM Analytics Engine v1.2 and Announcing the Deprecation of IBM Analytics Engine v1.0

We are excited to inform you about the new version of IBM Analytics Engine v1.2 that will be available starting May 15, 2019. Along with this release, Analytics Engine v1.0 will be retired.

Continue reading

April 16, 2019

Announcing the Deprecation of the Decision Optimization Beta Service

The End of Beta date for the Decision Optimization service is May 17, 2019. The End of Beta Support date is June 20, 2019.

Continue reading

April 2, 2019

Data Refinery and Profiling Changes in Watson Studio and Watson Knowledge Catalog

We'd like to announce data refinery and profiling changes related to Watson Studio and Watson Knowledge Catalog that will take effect on May 17, 2019.

Continue reading