December 26, 2016 | Written by: Darren Shaw
Categorized: Community | Compute Services | Data Analytics
Share this post:
Building an analytics application with realtime data, when you have no control over that data, can create hours of worry for a developer. Are the data volume estimates accurate? What will happen if the volume suddenly spikes? Can we scale up, or will our application be overloaded?
IBM Message Hub is a developer-friendly service that can help manage these problems, leaving developers free to focus on the analytics. This article explains how Message Hub was used to handle realtime social media content during Wimbledon 2016.
Wimbledon’s 2016 social media sports challenge
Since 1990 IBM has been the official supplier of Information Technology to Wimbledon and is responsible for providing its full digital platform (website, mobile apps, data and social media analytics).
These digital platforms are Wimbledon’s primary communication channel with fans during the championships, and social networks make up an important—and ever increasing—part of that channel. In order to best serve that audience, Wimbledon needs to listen to those networks. For 2016, IBM built a Cognitive Social Command Centre that carried out real-time analysis of social media to give Wimbledon’s media team a live view of what was being said about The Championships.
The early Summer of 2016 provided a packed sporting calendar: Wimbledon, Euro 2016, Tour de France, several F1 Grand Prix and the build up to the Rio Olympics. Wimbledon wanted to analyze not just social media content relating to tennis and their own tournament, but comparative content across all those major sports events.
Worrying about unpredictable data volumes
Perhaps the biggest challenge in developing the Cognitive Social Command Centre was in handling the volume of social content that would be generated during the peak moments of any of the sports events: the winning moment of a tennis match; scoring a goal in a Euro final; or a big crash in the Grand Prix. Conversely, the background level of social media posts surrounding, but not during, the action can be relatively low. We saw around a 40x increase in message rates from just a few minutes before the action started, to the peak during the event.
Message rates on social media can be unpredictable. Experience during previous tournaments gave us hints about what to expect, but there was always the opportunity for something we had failed to anticipate. If Justin Bieber turned up on the F1 grid, or Taylor Swift commented about a Euro 2016 penalty decision we would have seen a sudden (and difficult to predict) burst of activity.
Wimbledon’s requirement was for the Cognitive Social Command Centre to run in realtime, so we needed to be able to scale the analytics dynamically to cope with demand. Given that requirement, we also wanted to fail gracefully if messages were arriving faster than we could process them. We needed to avoid the application crashing or losing data however high the input rate reached.
IBM Message Hub was one component that helped us address this challenge.
Getting started with Message Hub
IBM Message Hub is a publish-subscribe messaging engine managed by Bluemix, based on Apache Kafka. Message Hub allows fast routing of large volumes of data. It’s used in in different environments to meet different use cases. Here I am going to talk about how we used it in Wimbledon’s Cognitive Social Command Centre for buffering and load balancing.
Message Hub has several fundamental concepts that are useful to understand: producers, topics, partitions, consumers and consumer groups.
There are three different APIs that can be used with Message Hub: Java, MQ Light and REST. For Wimbledon (and the purposes of this post), we used the Java API.
Producers are responsible for putting data on to Message Hub. In the Cognitive Social Command Centre as soon as we received a new message from a social network, it was put on to the Message Hub queue. This was before any analysis of the content had taken place, keeping the processing requirements to a minimum.
Producers publish messages to a specific topic. If that topic has multiple partitions, Message Hub will decide which partition (see below) to use for each message (this can be manually overridden, but for the Cognitive Social Command Centre we let Message Hub allocate partitions). The producer, when in its default configuration, will batch up messages before sending to Message Hub. It does this dynamically, without user input, to efficiently balance throughput against delay.
Topics are used to separate messages into different queues. You send messages to a specific topic with the producer and receive messages for a specific topic from the consumer. Topics need to be configured in the Message Hub administration panel before being used.
When creating a topic, you need to specify two properties: the maximum time period messages will stay on that queue and how many partitions should be used for the topic (see below).
Partitions are used to subdivide a topic. A partition effectively gives you another queue, adding scalability (storage and throughput) to a topic without the developer having to deal with the underlying infrastructure. There are several reasons to add partitions to a topic. In the Cognitive Social Command Centre we used partitions for load balancing (see below), but adding partitions also increases the storage space available to a topic, and can be used to increase redundancy.
Consumers are responsible for taking messages from Message Hub. Different parts of the Cognitive Social Command Centre application took data from different topics through consumers. Consumers receive messages from a specific topic or set of topics.
Message Hub can be used in different configurations. In one setup you may want many consumers to receive all the messages in a queue. An example would be a news application where headlines are sent out to the clients via Message Hub. In this case all the clients need to be able to receive all the messages.
In an alternative setup, you may want to distribute messages to a set of consumers, but only have each message go to just one consumer. An example of this would be if you want to use Message Hub to load balance across a set of processing consumers (see Load Balancing).
Consumer groups allow Message Hub to be used for both these scenarios. Each consumer assigns itself a consumer group name. Messages will be delivered to one consumer instance within each consumer group. If all consumers are in the same group, each message will be delivered to just one consumer. If all consumers are in different groups, each message will be delivered to all consumers.
Using buffering to spread the load
One of the main reasons for using Message Hub in building the Cognitive Social Command Centre was to establish a buffer between receiving social media messages and processing them. New content was pushed on to a Message Hub queue and processing nodes took that content from the queue for analysis. As the rate of messages coming in was unpredictable, having a buffer meant that even if we weren’t able to analyze the data as fast as it was arriving, we would not lose data. The buffer built up while data was arriving faster than it was processed, but was drained as soon as the input rate was lower than the processing rate.
One thing to be careful of regarding buffering are the storage limits in Message Hub. A partition can store up to 1GB of data. So you need to be careful that a partition never stores that much data or messages will be lost.
Scaling horizontally to increase throughput
In building the Cognitive Social Command Centre we also needed to scale the analytics processes. The analytics were running in Bluemix using Java Buildpack instances. This meant we were easily able to increase (and decrease) the number of these instances and therefore change the total processing capacity of our application though the Bluemix dashboard. Once we had multiple processing instances running, the challenge was in distributing the input data evenly between them. This was achieved by using partitions and consumer groups.
We had a single producer that took social media messages in realtime and added them to a Message Hub topic. This topic was configured to use multiple partitions.
Each analytic instance connected to Message Hub using the same consumer group. This meant that a social media message would only ever be sent to one analytic instance.
We needed to have at least as many partitions as the maximum number of distributed analytic processes we ever wanted to run. Having more partitions than needed does not cause a problem. In such a situation a consumer will connect to multiple partitions to ensure all messages are processed.
Message Hub Architecture
Partitions cannot be created dynamically, the topic needs to be recreated, so it is better to create more partitions than you need, though there are fees charged by Bluemix per partition.
How to try it out
The Cognitive Social Command Centre was a complex system, relying on many Bluemix APIs and services, but Message Hub proved to be the critical means of integration and allowed us to cope with the scale and rates of data throughout the two weeks of The Championship. Message Hub can also be useful for smaller applications, either as a means of load balancing, buffering or integration. For example code and more details on the different ways of configuring Message Hub, see Message Hub Example Code – Tutorial.
To explore the catalogue of APIs on Bluemix and start building for free, you can sign up for a a 30 day trial today.
Sign Up Now
Not a developer but want to explore how we can help transform your ideas into apps? Then the Bluemix Garage could be a great place to start. They are a consultancy with the DNA of a startup that helps companies build engaging applications with IBM Design Thinking. Request a meeting with our Bluemix Garage today.
Meet with a Bluemix Garage consultant