Share this post:
The core features comprising Watson Data Platform, Data Science Experience and Data Catalog on IBM Cloud, along with additional embedded AI services, including machine learning and deep learning, are now available in Watson Studio and Watson Knowledge Catalog. Get started for free at https://ibm.co/watsonstudio.
With the advent of sensors in all walks of life, the Internet of Things is on a path to generate the Biggest Big Data our planet has known. How can we successfully harness this ocean of data? We need end-to-end IoT data pipelines that collect, store, and analyze the data. In this blog, we introduce the Message Hub Object Storage bridge and show how it can enable just such an end-to-end IoT data pipeline on IBM Cloud.
An end-to-end IoT data pipeline
We all know we can’t boil the ocean, that’s why Big Data warrants specialized tools for collection, storage, and analysis—that are extremely scalable and cost effective. Object Storage is perfectly suited to storing massive amounts of IoT data at low cost, and allows analytics frameworks to access the data directly. So how do we get the data into Object Storage and how do we analyze it ? The answer is to use best of breed open source frameworks such as Apache Kafka and Apache Spark. Instead of deploying and managing these services in house, why not benefit from a hosted solution, managed by experts in each of these areas ? IBM Cloud provides services based on these frameworks called Message Hub and Spark as a Service, respectively. With the Object Storage bridge, data pipelines from Message Hub to Object Storage can be easily set up and managed to generate analytics-ready data, which can be analyzed directly by the Data Science Experience using Spark as a Service. Moreover, the Watson IoT Platform can be used to capture IoT device data and send it to Message Hub – more information is available here.
The COSMOS Madrid Traffic Use Case
Let’s consider a real-life use case that we explored in the context of the EU-funded COSMOS research project: Madrid Council manages 3000 sensors across Madrid, publishing traffic speed, intensity, and other data via a web service every 5 minutes. The council runs control rooms where traffic engineers continuously monitor the captured traffic data; this can be slow and costly. Our aim was to help traffic engineers respond more quickly and efficiently to traffic problems. By comparing current traffic conditions with historical data for the same location and time of day/day of week, we were able to automatically detect traffic anomalies in real time.
Our Solution on the IBM Bluemix Platform
The diagram below shows the architecture of our solution on IBM Cloud. Madrid council publishes traffic sensor data using a web service; we used Node-RED to collect sensor readings from this service and publish them directly to Message Hub. These readings are collected by the Object Storage bridge and periodically uploaded to the Object Storage service. Spark as a Service accesses this data to calculate historical data patterns for each sensor and derive thresholds. Real time sensor readings that cross these thresholds are considered anomalous and Node-RED detects this and raises an alarm.
Close Up View of the Object Storage Bridge
The Object Storage bridge is a new Message Hub feature which completes our IoT data pipeline. With a few clicks, an IBM Cloud user can create a new bridge, connecting a Message Hub topic to an Object Storage container. In this way, Message Hub messages are aggregated and uploaded to Object Storage. The data can be easily organized there in such a way that it can be directly analyzed by the Spark service using SQL queries.
Options for bridge creation include:
- How often to upload new objects. One can use a time-based policy (e.g., every 60 minutes) or a size-based policy (e.g., every 10 MB). Whichever event happens first triggers an upload.
- Whether to organize the data in Object Storage according to Kafka offsets or dates. Organizing data according to dates allows Spark SQL queries involving date ranges to access only relevant objects.
For our Madrid Traffic application we choose to upload objects every 15 minutes and organize the data according to dates. This means that all messages in an object have the same date and the date appears as part of the object name.
A bridge doesn’t apply any format transformations, and messages are concatenated using newline characters. For example, our Madrid Traffic application bridge consumes JSON messages from Message Hub, and generates objects consisting of multiple JSON messages separated by newlines.
The Message Hub service manages bridges and monitors them. If a bridge fails, it is automatically restarted. Typically, messages are delivered once to object storage, although some failure cases may result in duplicate messages.
The Object Storage bridge publishes metrics that can be displayed on the IBM Cloud platform Grafana dashboards. Metrics of interest include the bridge’s rate of consumption, and the maximal Kafka offset lag for the bridge. These metrics help indicate whether the bridge is keeping up with the producers for its topic.
Our demo application bridge consumers data roughly every 5 minutes from the Madrid Council web service. Its consumption pattern can be clearly seen in the above Grafana dashboard showing its bytes-consumed-rate.
Find out more
Using the Object Storage Bridge, complete IoT data pipelines can be deployed and managed on IBM Cloud —without writing application code to do the plumbing. Instead, you can focus on your application logic. Find out more information via the IBM Cloud Message Hub documentation.