- Getting Twitter credentials
- Creating a Bluemix account
- Creating a Data Science Experience (DSX) service
- Problem statement
- Overview of tools and services
- Solution architecture
- Data ingestion and exploration
- Data enrichment
- Machine learning for user segmentation and personalized messaging
- Downloadable resources
- Related topics
Extract insights from social media posts with Watson and Spark in Data Science Experience
Drive value by acquiring, curating, cleansing, analyzing, visualizing, and enriching data
According to statistics from Excelacom, 20.8 million WhatsApp messages, 2.4 million Google searches, and 347,222 tweets occur in every internet minute (as of April 2016). Much of the massive amounts of data that is generated and consumed is unstructured data in the form of text, speech, images, and video. To manage and use this data, there is a need for a new computing paradigm, a cognitive computing paradigm that helps extract insights from big data.
Cognitive computing systems are defined as systems that learn and interact naturally with humans to help them analyze, understand, and extract insights from big data. The IBM Watson Developer Cloud is a platform that offers a wide variety of cognitive services that are designed to extract knowledge from unstructured data in all possible formats: text, speech, and images. Several cognitive solutions have used the Watson Developer Cloud services to address various business problems such as:
- Providing virtual assistants (or chat bots)
- Social media listening
- Marketing campaign analytics
- Audience segmentation and matching
- Discovering insights from large amounts of data
In several cognitive solutions, we find the most impactful results are achieved by combining Watson Developer Cloud services with analytics solutions that are optimized for big data. In this tutorial, we explain how to develop a cognitive solution that combines Watson Developer Cloud services with custom machine learning solutions using IBM Data Science Experience (DSX). This tutorial references the Python TwitterInsightsWatsonDSX notebook, which would be uploaded and run in DSX.
To be able to complete the tutorial, you will need:
- A Bluemix account where you will provision the Watson services, database services, and other analytics services including DSX. In DSX, you will step through a Python notebook for acquiring data; curating, ingesting, and enriching the acquired data; and running various analysis and machine learning techniques on the data
- A Twitter account to collect tweets that will serve as the data source for the notebook
Getting Twitter credentials
You will need the following Twitter credentials to run through this tutorial.
- Consumer Key (API Key)
- Consumer Secret (API Secret)
- Access Token
- Access Token Secret
If you already have these Twitter credentials, you can skip this section. If not, here are quick instructions to get these credentials. To begin, you must have a Twitter account. If you don't have an account, you can sign up for one at https://twitter.com/signup. After you have a Twitter account, run the following steps:
- Point your browser to https://apps.twitter.com/.
- Log in with your Twitter account user name and password.
- Click Create New App.
- Provide details on your application such as the name, description, and website. Check
that you have read and agree to the Twitter Developer Agreement. Click Create
your Twitter application.
This creates an application.
- On the application page, navigate to the Keys and Access Tokens tab. On this page, copy the Consumer Key (API Key) and Consumer Secret (API Secret) because you'll need them in the tutorial.
- To get the access tokens, click Create my access token. This generates Access Token and Access Token Secret credentials. Copy these credentials.
At this point, you should have the required credentials to connect to Twitter's APIs.
Creating a Bluemix account
To work with this tutorial, you must have a Bluemix account so that you can provision Watson and database cloud services. To create a Bluemix account:
- Point your browser to https://bluemix.net.
- Click Create a free account and follow the steps to provide the required information such as email, name, and password.
Now that you have a Bluemix account, you can run the following steps from your terminal to create the required services for this tutorial, namely Natural Language Understanding and Personality Insights.
- Open a Terminal window.
- Download and install Cloud Foundry CLI.
- Run the following commands in a Terminal window.
cf login cf create-service natural-language-understanding free dsxnlu cf create-service-key dsxnlu svcKey cf service-key dsxnlu svcKey cf create-service personality_insights lite dsxpi cf create-service-key dsxpi svcKey cf service-key dsxpi svcKey
These commands connect you to your Bluemix account, create a Natural Language Understanding service and Personality Insights service that uses free plans and call them dsxnlu and dsxpi, and return the user name and password for the services.
- Copy the credentials because they're needed in the notebook to access the services.
Creating a Data Science Experience (DSX) service
To create a DSX service:
- Log in to your Bluemix account.
- Click the Catalog tab in the top right (highlighted with an arrow in the following figure).
- Select the Data & Analytics category in the left navigation.
- Select the Data Science Experience service (highlighted with an oval in the following figure).
- Name your service (optional), select the Free plan and click Create.
- Click Get Started to launch Data Science Experience.
Alternatively, you can launch DSX by pointing your browser to https://datascience.ibm.com, selecting Sign In (top right), and logging in with your Bluemix user name and password.
This tutorial looks at the problems with brand analytics, user segmentation, and personalized messaging. My solution involves collecting social media posts that reference a brand, understanding the sentiment toward the brand, and segmenting the consumers based on multiple parameters such as number of followers, number of posts, sentiment, and personality profile. Given the more granular segmentation, the brand manager and marketing teams can then provide targeted messaging and marketing to reach consumers in a more personal way.
Because we want to keep this tutorial more generic, we won't reference any specific brand. Instead, I'll collect data on three popular musicians and run my analysis on that data.
Overview of tools and services
Before we dive into the details of the solution, I'll describe the various tools and services that I'll use. Specifically, we rely on Twitter, the Watson Developer Cloud, Db2 Warehouse on Cloud, and Data Science Experience. In this section, I'll cover each of these tools and services individually.
Social media are computer mediated technologies that make it easier to create and share information, ideas, and thoughts by using virtual communities and networks. Some of the popular social media platforms include Facebook, Twitter, LinkedIn, Pinterest, and Snapchat.
It's become a common practice for brands to connect with their consumers and better understand how these consumers perceive the brand by listening to what's being said on social media. Social media listening refers to collecting social media posts from various platforms and analyzing them to understand overall consumer perception.
In this tutorial, we collect Twitter data on three popular musicians (@katyperry, @justinbieber, and @taylorswift13). It is worth noting that my approach can work with other social media platforms or other data sources where consumers share their opinions on brands, events, or entities of interest.
Watson Developer Cloud
The IBM Watson Developer Cloud is a platform of cognitive services that enable you to build cognitive solutions to extract insights from big data. Watson Developer Cloud services offer a wide range of capabilities to understand and extract insights from unstructured data, including text, speech, and images. In this tutorial, we use Natural Language Understanding for sentiment analysis and keyword extraction. We also use Personality Insights to extract sentiment and keywords that are expressed in tweets and the personality profile of the users sharing the tweets.
Db2 Warehouse on Cloud
IBM Db2 Warehouse on Cloud is a database that is designed for performance and scale and is compatible with a wide range of tools. The massively parallel processing (MPP) options enable increased performance and scale by adding more servers to your cluster. The dynamic in-memory columnar store technology minimizes I/O and delivers an order of magnitude speed when compared to row-store databases.
Data Science Experience
IBM Data Science Experience (DSX) is a cloud-based social workspace that helps you create, consolidate, and collaborate on building solutions for capturing insights from data across multiple open source tools such as R, Python, and Scala. IBM DSX helps data explorers use a rich set of open source capabilities to analyze large data sets and collaborate with colleagues in a social collaborative data-driven environment.
Your DSX account includes an Apache Spark service (provisioned on Bluemix) by default. Apache Spark is a fast open source cluster computing engine for efficient large-scale data processing. Apache Spark technology enables programs to run up to 100 times faster than Hadoop MapReduce in-memory or 10 times faster on disk. Spark consists of multiple components:
- Spark Core is the underlying computation engine with the fundamental programming abstraction called resilient distributed data sets (RDDs).
- Spark SQL provides a new data abstraction that is called DataFrames for structured data processing with SQL and domain-specific language.
- MLlib is a scalable machine learning framework for delivering distributed algorithms for mining big data.
- Streaming uses Spark's scheduling capability to perform real-time analysis on streams of new data.
- GraphX is the graph processing framework for the analysis of graph structured data.
In Data Science Experience, you can use Spark for your Python, Scala, or R notebooks.
DSX includes a rich set of community-contributed resources such as data science articles, sample notebooks, public data sets, and various tutorials that make it easy to use DSX and Apache Spark.
Additionally, your DSX account includes an Object Storage service that is provisioned on Bluemix under a free plan that includes one service instance with a limit of 5 GB of storage. (The Object Storage plan can be upgraded without disruption.) The Object Storage service provides an unstructured cloud data store where you can store your files, including images, documents, and more.
To summarize, DSX provides a social collaborative environment where you can upload large data sets into an Object Storage service and use the fast Apache Spark computing engine to efficiently explore, analyze, visualize, and extract insights from large structured and unstructured data sets. It also offers an easy and seamless connection to GitHub where you can upload and share your notebooks. DSX's community feature makes it easy to share and explore various notebooks, data sets, and tutorials that are built by all DSX community members.
In this tutorial, we focus on using DSX to build Python notebooks to analyze Twitter data and integrate that data with Watson Developer Cloud services.
The following two notebooks will help you quickly get started with Python and Apache Spark:
For reference, a Jupyter notebook is a web-based environment for interactive computing. You can run code and view results of your computation interactively. Notebooks include all building blocks needed to work with data, including the data, the code to process the data, visualization of results, and text and rich media to document your solution and enhance your understanding.
The following image shows the solution architecture where tweets are collected from Twitter and saved into a Cloudant database. The Cloudant database is saved into a Db2 Warehouse on Cloud warehouse , which is then imported into Object Storage. The notebook in DSX ingests data from Object Storage and uses Spark for data curation, analysis, and visualization. Furthermore, the notebook connects to Watson services (Natural Language Understanding and Personality Insights) to enrich the tweets and extract sentiment, keywords, and user personality traits. Finally, the notebook uses Spark MLlib to cluster the users based on several features including personality traits.
Given these user clusters, the application can identify the right message to send to users.
Data ingestion and exploration
The first step is always to acquire relevant data, understand it, and process it into the right format. As mentioned earlier, for this tutorial we collect social media data, specifically Twitter data that references three musicians: @katyperry, @justinbieber, and @taylorswift13. Next, we explore the data to get a better understanding of what it represents. We look at the schema and visualize the data to understand it better. After that, we run some preprocessing to get the data in an adequate format for further processing.
There are various third-party services for acquiring Twitter data such as Twitter GNIP.
In this tutorial, we use Twitter Streaming APIs to collect tweets mentioning "@katyperry," "@justinbieber," or "@taylorswift13" and process them to capture metadata of interest before saving them to a Cloudant database as described in the https://github.com/joe4k/twitterstreams notebook.
For a general approach on using Node-RED, Cloudant, and Db2 Warehouse on Cloud to collect and store Twitter data, look at this video.
After we are in a Cloudant database, we follow the instructions that are referenced in the previous video to create a Db2 Warehouse on Cloud warehouse for that Cloudant database. To proceed with this tutorial, you must have a Db2 Warehouse on Cloud service instance that is populated with the tweets you collected mentioning "@katyperry," "@justinbieber," or "@taylorswift13."
Make sure to note the name of the Db2 Warehouse on Cloud service instance you're using as a warehouse for your Cloudant database to host all the tweets.
Assuming that you have collected tweets in a Db2 Warehouse on Cloud database, you can proceed by running the following steps:
- Log in to DSX or launch DSX from Bluemix.
- Click Data Services and select Connections.
- Click the + sign to create a new connection.
- Provide a connection name (dashdbsingers) and select Data Service for
the Service Category:
- Data Service would connect to any Bluemix data service you have provisioned under your account.
- External would connect to external data services such as Amazon S3, Microsoft SQL Server, MySQL, Netezza, and several others. Note that you can also use the External category to connect to other IBM databases such as Cloudant, DB2, or Db2 Warehouse on Cloud if provisioned under someone else's Bluemix account.
- Select the Db2 Warehouse on Cloud service instance that you created earlier as a Cloudant warehouse (in the Data ingestion section).
- Select the BLUDB database and click Create.
- Click the Projects tab and navigate to View All Projects.
- Create a new project.
- Name the project dsxwdc. You can also provide an optional description.
- Select a Spark Service and Storage Type to associate with the project. The drop-down menu provides a list of the Spark services that have been provisioned under your Bluemix account.
- Navigate to the Overview page and click your project name.
This loads all notebooks, data assets, and bookmarks associated with that project. Because this is a new project, it will be empty.
- Click the Find and add data icon in the upper right corner to expand available Files and Connections.
- Click Connections and find your dashdbsingers connection. Select it
and click Apply.
This makes the data that is stored in that Db2 Warehouse on Cloud database accessible for use in the notebooks under that project.
- Click add notebooks to add a notebook.
- Name the notebook twitter_insights_watson_dsx. You can also add an optional description.
- Select From URL to specify a URL from where to copy the notebook.
For the notebook URL, specify https://github.com/joe4k/dsxwdc/blob/master/twitter_insights_watson_dsx.ipynb.
So far, you've created a new project in DSX and created a connection to a Db2 Warehouse on Cloud service instance that includes approximately 200,000 tweets that mention "@katyperry," "@justinbieber," or "@taylorswift13" between 05-12 July, 2017. In this tutorial, I'm limiting the number of tweets. In practice, you can collect millions of tweets and run the analysis on those.
Next, I'll run some analytics to evaluate and explore the tweets data.
- In your notebook, run Step 1 to install the required libraries to run the notebook.
- Run Step 2 to load the data from Db2 Warehouse on Cloud. To do so, click the Find and Add Data icon.
This opens the Files and Connections window. Click Connections, which should show the Db2 Warehouse on Cloud connection (dashdbsingers).
- Click Insert to code and select Insert SparkSession DataFrame.
- Specify the table as DSX_CLOUDANT_SINGERS_TWEETS.
This creates the Python code that is needed to connect to the Db2 Warehouse on Cloud service and loads the DSX_CLOUDANT_SINGERS_TWEETS table into a Spark Dataframe called data_df_1 (the name might be different) by using the spark.read.jdbc function.
- Copy the created Spark Dataframe data_df_1 into a new dataframe brandTweetsDF.
- Run Step 3 to explore and curate the data. Some common commands include:
Print the top two rows in the Spark DataFrame to get a sense for the Twitter data you're using.
toPandasprovides a nice print of the data in a Table format.
Print the schema of the Spark DataFrame to make sure that you have all the fields that you expect.
Drop the unneeded columns.
You can also create functions to process the data and add fields to the Spark DataFrame, such as extracting the "day" information from the CREATED_AT field to plot tweet trends over time.
- Extract a random sample of the data to process. In practice, you want to skip this step and apply the rest of the notebook on all the data. The reason that we take the sample is to be able to apply the free plan of the Watson services for enrichment.
Now that we have collected relevant tweets and ingested them into a Spark DataFrame, I'll focus on enriching the data by using Watson Developer Cloud services.
Sentiment and keyword enrichment using Natural Language Understanding
In particular, we extract sentiment and keywords in tweets by using the Watson Natural Language Understanding service. We also use Watson Personality Insights to extract personality profiles for the users sharing these tweets.
- Run Step 5 of the notebook to read the credentials for Twitter and the Natural
Language Understanding and Personality Insights services. The notebook assumes that
you have these credentials in a JSON file (sample_creds.json) of the following format.
Step 5 shows the instructions to upload and read the sample_creds.json file to Object Storage and parse the credentials.
- Run Step 6 in the notebook to extract sentiment and keywords from
tweets by calling the Natural Language Understanding service. To do so, we use the
Python SDK for Watson Developer Cloud services and the credentials for Natural
nlu = watson_developer_cloud.NaturalLanguageUnderstandingV1(version=nlu_version, username=nlu_username, password=nlu_password)
To get Natural Language Understanding credentials, you must provision a Watson Natural Language Understanding service on Bluemix as explained previously. For reference, you can find detailed instructions on the Natural Language Understanding Getting started page.
Data visualization (sentiment and keywords)
After extracting sentiment and keywords from the unstructured data (tweets), we can use these enrichments to visualize tweet trends, sentiment, and keywords. We can do this for each brand separately to provide insights to the brand manager and marketing team on consumers' perceptions toward the brand. We can also compare and contrast the results across brands.
Run Step 7 of the notebook to plot sentiment and trends of the tweets over time.
We separate tweets by brand so that we can plot sentiment and trends for each brand separately because it would be useful to compare and contrast trends, sentiments, and keywords for the three musicians.
Here are some of the visualizations that we can produce with the data we have after enriching with Watson services.
The following figure shows the overall sentiment distribution (positive, negative, neutral) of the tweets toward the three musicians.
The timeline plot in the following figure shows the trend (number of tweets) for all three brands (musicians). It also shows the positive, negative, and total number of tweets for each brand.
The keywords word cloud plot in the following figure shows the most relevant keywords that are mentioned in the tweets for the brands.
User personality enrichment using Personality Insights
Next, we focus on the users sharing these tweets. Traditional segmentation methods might focus on creating clusters of users based on the number of tweets that they post or the number of followers that they have. In this notebook, we show how you can enrich the users' information with their personality profile, which in turn allows you to create finer segmentation that accounts for users' personality profiles.
To do so, we first identify all unique users who are contributing posts to the list of tweets we collected. We use Watson Personality Insights to create personality profiles for the users based on their tweets. This tutorial explains how to extract unique users based on the USER_SCREEN_NAME. Then, for each user, it shows how to use Twitter to collect enough tweets for that user, which are then passed to Personality Insights to obtain the personality profile. We limit the analysis to 100 users simply to illustrate the approach. In practice, you want to create personality profiles for all users (or maybe all users with a certain number of followers or posts). Furthermore, for each user, you want to collect a large enough sample of tweets for accurate Personality Insights results as explained in the Personality Insights documentation. In this tutorial, we limit it to 100 tweets per user.
Run Step 8 in the notebook to extract useful information about users such as the number of unique users in the given data set of tweets and which users expressed negative sentiment versus positive sentiment. Some useful commands include:
df.groupBy('FIELD_NAME'): This command is useful to group records by the specific FIELD_NAME.
df.orderBy('FIELD_NAME'): This command is useful to rank records by the specific FIELD_NAME.
df.where(col('SENTIMENT_LABEL')=="negative"): Use this command to extract records with negative sentiment.
df.where(col('SENTIMENT_LABEL')=="positive"): Use this command to extract records with positive sentiment.
df.sample(False, fraction, seed): This command extracts a random sample from the data
Step 8 in the notebook also extracts the Big 5 personality traits (also referred to as OCEAN) for each unique user in the sample of users you're working with. These personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) are helpful in better understanding the users and reaching out to them.
Machine learning for user segmentation and personalized messaging
Now that I've collected and enriched relevant social media posts regarding the brand, I can use the rich set of machine learning algorithms available with Spark MLlib for user segmentation.
Data prep and Kmeans clustering
In particular, we use a Kmeans clustering algorithm to group users based on their personality profile, number of followers, and number of posts. To illustrate the difference, we actually run Kmeans clustering using two different feature sets, one without personality traits and one with personality traits:
- FeatureSet 1: (SENTIMENT, USER_FOLLOWERS_COUNT, USER_STATUSES_COUNT)
- FeatureSet 2: (SENTIMENT, USER_FOLLOWERS_COUNT, USER_STATUSES_COUNT, OPENNESS, CONSCIENTIOUSNESS, EXTRAVERSION, AGREEABLENESS, NEUROTICISM)
Run Step 9 in the notebook to get the data into the correct format and then run the Kmeans algorithm for clustering the users. Some of the useful commands include:
Transform a column into a vector:
assembler_field = VectorAssembler( inputCols=["FIELD_NAME"], outputCol="vector_field_name") assembled_field = assembler_field.transform(df) assembled_field = assembled_field.select("FIELD_NAME_1","FIELD_NAME","vector_field_name")
Scale a field using MinMaxScaler:
scaler_field = MinMaxScaler(inputCol="vector_field_name", outputCol="scaled_field_name") scalerModel_field = scaler_field.fit(assembled_field) scaledData_field = scalerModel_field.transform(assembled_field) df_scaled=scaledData_field.select(("FIELD_NAME_1","scaled_field_name)
Select specific features for clustering and map to a Vector:
df _noPI = df_scaled.select('SENTIMENT','SCALED_USER_FOLLOWERS_COUNT','SCALED_USER_STATUSES_COUNT') df _wPI = df_scaled.select('SENTIMENT','SCALED_USER_FOLLOWERS_COUNT', 'SCALED_USER_STATUSES_COUNT', \ 'OPENNESS', 'CONSCIENTIOUSNESS','EXTRAVERSION', 'AGREEABLENESS','NEUROTICISM') from pyspark.mllib.linalg import Vectors df_noPI = df_noPI.rdd.map(lambda x: Vectors.dense([c for c in x])) df_wPI = df_wPI.rdd.map(lambda x: Vectors.dense([c for c in x]))
Kmeans Clustering (base and PI_ENRICHED):
From pyspark.ml.clustering import KMeans.
## Define model parameters and set the seed
baseKMeans = KMeans(featuresCol = "BASE_FEATURES", predictionCol = "BASE_PREDICTIONS").setK(5).setSeed(206) piKMeans = KMeans(featuresCol = "PI_ENRICHED_FEATURES", predictionCol = "PI_PREDICTIONS").setK(5).setSeed(206)
## Fit model on the feature vectors
baseClustersFit = baseKMeans.fit(userPersonalityDF.select("BASE_FEATURES")) enrichedClustersFit = piKMeans.fit(userPersonalityDF.select("PI_ENRICHED_FEATURES"))
## Get the cluster IDs for each user
userPersonalityDF = baseClustersFit.transform(userPersonalityDF) userPersonalityDF = enrichedClustersFit.transform(userPersonalityDF)
Data visualization (User clusters with and without personality traits)
After creating user clusters based on both structured metadata (such as the number of followers and the number of posts) and enriched metadata that is extracted from unstructured data (such as the sentiment of the tweet and the personality traits of the users), we can run some visualizations to understand the differences between the segmentation solutions.
At a simplistic level, to illustrate that the results are different, we can plot a pie chart that shows the number of users in each cluster for both scenarios, without and with personality traits.
Run Step 10 in the notebook to show some visualizations of the Kmeans clustering solution for both scenarios, without and with personality traits extracted with Personality Insights.
The pie chart in the following figure shows the number of users in each cluster with and without personality traits. This is a very simplistic visualization to show that the clustering solutions are different when including personality traits.
Typically, you would visualize clusters by plotting some aggregate measure of the data, then completing the data points with different colors based on the cluster ID. However, in the absence of aggregate metrics, we can use Principal Components Analysis to compress the data set down to two dimensions. After I've performed PCA, we can then plot the values of the two components on the X and Y axis to form a scatterplot. The following figures show the clustering results with base features only (figure 1) and with both base features and personality traits (figure 2). Note that clustering results for your run might be different.
Given these user clusters, the brand manager and marketing teams can craft personalized messages to reach out to these users. They can track these user clusters over time to see how the users respond to various metrics such as purchase history, click patterns, or the response to different ad campaigns.
In this tutorial, we explained how you can go through the complete journey of acquiring data, curating and cleansing the data, analyzing and visualizing the data, and enriching the data to drive value. In my example scenario, the value was in delivering better personalized messaging to consumers by understanding their personalities and their social media presence. Although we used small data samples in this analysis, the referenced technology (DSX, Spark, Object Storage) scales to handle big data efficiently.