July 27, 2015 | Written by: Frederic Lavigne
Share this post:
In this previous post, I introduced SOmusic, an app searching Twitter for Spotify links. SOmusic gathers this information and produces rankings—it’s your daily dose of what’s popular on social networks:
The app combines several Bluemix services. One key service is the IBM Insights for Twitter service. How does it work and how is SOmusic using it?
10% of Twitter data at your fingertips
The IBM Insights for Twitter service provides the developer search access to Twitter Decahose (a 10% random sample of the realtime Twitter Firehose) and historical content dating back to November 2013. The Decahose is a lot of data and that’s the key. Without the data, SOmusic and many apps out there would not even exist. You can build the best user interaction, the most complex algorithm, if there is no data to feed those, you have nothing. And this is what makes the Insights for Twitter service very critical in this context. Once provisioned, and bound to your app, it gives you access to real time and historical tweets – and even more as the tweet data is enriched with sentiment and other insights for multiple languages, based on deep natural language processing algorithms from IBM Social Media Analytics.
With historical data, it means you can ask to get tweets from 2013 and mine these data – such as detecting trends over the last two years. In my case, I used this to build the SOmusic index for March 2015, although the site was really developed in April. Another use would be to drop the current SOmusic statistics and reindex everything from the start after having modified the source code to collect/aggregate data differently as example.
IBM Insights for Twitter does the heavy lifting
Working with the service is very simple. The API is straightforward and easy to use:
/v1/messages/count – Returns the number of Tweets found for a given query.
/v1/messages/search – Returns a set of Tweets for a given query. To retrieve additional data, API clients should use the URL provided in related.next.href. The query language allows not only to search the tweet message but also advanced attributes like location, language, country, followers count, users that have children, users that are married, posted time, sentiment.
/v1/messages/check – Determines whether a list of messages complies with the intentions of Twitter and its users at the time that the query was issued.
Here is an extract of the results you receive when you query the service for “somusic bluemix”:
[gist https://gist.github.com/l2fprod/5bf5c9b6a73c0b8ab016 file=”insights-for-twitter-results.json”]
The first tweet (starting at line 11) says “SOmusic – music from social networks #music #bluemix #cloudant http://t.co/0auJetjZZx http://t.co/ymXVSxDqLT” in the body. It shows:
- a unique ID (id“: “tag:search.twitter.com,2005:597277951177003009“). Useful to avoid processing the same tweet twice.
- a geo location – the tweet was made from the Netherlands
- a list of Expanded URLs -the two links were shortened during the post. Just looking at them you don’t really know what you are dealing with. The Decahose automatically expands those URLs and makes them available in its output, ready to use.
- the tweet language – it is in English
- the Insights enrichment – the tweet is “NEUTRAL” and the gender “UNKNOWN” – well, it is a plant (@Bluemix_Plant) tweeting!
- and more content about the author, the hashtags. Lot of data you can mine.
The second tweet (starting at line 241) says “awajeet: PCFLB: RT flocalvez: You can review the cool Somusic demo at #IBM booth dotScale & win #Bluemix 30d free … http://t.co/PzrnAjOtaW” in the body and includes more interesting enrichment content:
- the sentiment is POSITIVE
- the gender is male
related element gives a link to the next search page – you should follow this link to get more results out of the search. Check the API specification and the service documentation for the query language and the output format.
SOmusic and the TwitterCollector
In SOmusic, the collector (
CollectItems) is responsible for querying the IBM Insights for Twitter service, looking for references to music links.
CollectItems currently searches for references to “spotify”, but this is only the beginning – there are other music providers with a developer API! The more (data) the merrier. Once done, it stores the resulting tweets into a IBM Cloudant database to be processed by another task.
TwitterCollector is the class implementing the call to the service API endpoint. It builds the full endpoint path by looking at the
VCAP_SERVICES for the Insights for Twitter credentials, then building the query and paging through the query results until there are no more to look at:
[gist https://gist.github.com/l2fprod/5bf5c9b6a73c0b8ab016 file=”TwitterCollector.java”]
If you want to look at SOmusic source code, the project is now public on Bluemix DevOps:
What about you, any plans to dig into this big dataset for your next project?