What's New

Building End-to-End IoT Data Pipelines in the IBM Cloud using the Message Hub Object Storage Bridge

Share this post:

The core features comprising Watson Data Platform, Data Science Experience and Data Catalog on IBM Cloud, along with additional embedded AI services, including machine learning and deep learning, are now available in Watson Studio and Watson Knowledge Catalog. Get started for free at https://ibm.co/watsonstudio.

 

With the advent of sensors in all walks of life, the Internet of Things is on a path to generate the Biggest Big Data our planet has known. How can we successfully harness this ocean of data? We need end-to-end IoT data pipelines that collect, store, and analyze the data. In this blog, we introduce the Message Hub Object Storage bridge and show how it can enable just such an end-to-end IoT data pipeline on IBM Cloud.

An end-to-end IoT data pipeline

We all know we can’t boil the ocean, that’s why Big Data warrants specialized tools for collection, storage, and analysis—that are extremely scalable and cost effective. Object Storage is perfectly suited to storing massive amounts of IoT data at low cost, and allows analytics frameworks to access the data directly. So how do we get the data into Object Storage and how do we analyze it ? The answer is to use best of breed open source frameworks such as Apache Kafka and Apache Spark. Instead of deploying and managing these services in house, why not benefit from a hosted solution, managed by experts in each of these areas ? IBM Cloud provides services based on these frameworks called Message Hub and Spark as a Service, respectively.  With the Object Storage bridge, data pipelines from Message Hub to Object Storage can be easily set up and managed to generate analytics-ready data, which can be analyzed directly by the Data Science Experience using Spark as a Service. Moreover, the Watson IoT Platform can be used to capture IoT device data and send it to Message Hub – more information is available here.

blog_pipeline2

The COSMOS Madrid Traffic Use Case

Let’s consider a real-life use case that we explored in the context of the EU-funded COSMOS research project: Madrid Council manages 3000 sensors across Madrid, publishing traffic speed, intensity, and other data via a web service every 5 minutes. The council runs control rooms where traffic engineers continuously monitor the captured traffic data; this can be slow and costly. Our aim was to help traffic engineers respond more quickly and efficiently to traffic problems. By comparing current traffic conditions with historical data for the same location and time of day/day of week, we were able to automatically detect traffic anomalies in real time.

blog_madrid_council

Our Solution on the IBM Bluemix Platform

The diagram below shows the architecture of our solution on IBM Cloud. Madrid council publishes traffic sensor data using a web service; we used Node-RED to collect sensor readings from this service and publish them directly to Message Hub. These readings are collected by the Object Storage bridge and periodically uploaded to the Object Storage service. Spark as a Service accesses this data to calculate historical data patterns for each sensor and derive thresholds. Real time sensor readings that cross these thresholds are considered anomalous and Node-RED detects this and raises an alarm.

blog_solution

Close Up View of the Object Storage Bridge

The Object Storage bridge is a new Message Hub feature which completes our IoT data pipeline. With a few clicks, an IBM Cloud user can create a new bridge, connecting a Message Hub topic to an Object Storage container. In this way, Message Hub messages are aggregated and uploaded to Object Storage. The data can be easily organized there in such a way that it can be directly analyzed by the Spark service using SQL queries.

Options for bridge creation include:

  • How often to upload new objects. One can use a time-based policy (e.g., every 60 minutes) or a size-based policy (e.g., every 10 MB). Whichever event happens first triggers an upload.
  • Whether to organize the data in Object Storage according to Kafka offsets or dates. Organizing data according to dates allows Spark SQL queries involving date ranges to access only relevant objects.

For our Madrid Traffic application we choose to upload objects every 15 minutes and organize the data according to dates. This means that all messages in an object have the same date and the date appears as part of the object name.

blog_objects_big

A bridge doesn’t apply any format transformations, and messages are concatenated using newline characters. For example, our Madrid Traffic application bridge consumes JSON messages from Message Hub, and generates objects consisting of multiple JSON messages separated by newlines.

blog_object_content

The Message Hub service manages bridges and monitors them. If a bridge fails, it is automatically restarted. Typically, messages are delivered once to object storage, although some failure cases may result in duplicate messages.

The Object Storage bridge publishes metrics that can be displayed on the IBM Cloud platform Grafana dashboards.  Metrics of interest include the bridge’s rate of consumption, and the maximal Kafka offset lag for the bridge. These metrics help indicate whether the bridge is keeping up with the producers for its topic.

blog_grafana

Our demo application bridge consumers data roughly every 5 minutes from the Madrid Council web service. Its consumption pattern can be clearly seen in the above Grafana dashboard showing its bytes-consumed-rate.

Find out more

Using the Object Storage Bridge, complete IoT data pipelines can be deployed and managed on IBM Cloud —without writing application code to do the plumbing. Instead, you can focus on your application logic. Find out more information via the IBM Cloud Message Hub documentation.

More What's New stories

Movius teams with IBM Cloud to disrupt the telecom market

“With our mobile cloud solution, we are doing for phones what Windows did for the PC.” Anath Siva, Chief Customer Officer, Movius

Continue reading

Announcing IBM Key Protect Release in Frankfurt EU Region on IBM Public Cloud

Data leakage associated with data stored in the cloud is one of the top security concerns of cloud computing users. Having the ability to protect data with cryptographic encryption key management is critical to the overall security of a cloud ecosystem. Company policies, industry best practices, and government regulations are increasingly requiring data-at-rest encryption supported by encryption key management to be included as fundamental components of overall data storage, data management, and data governance.

Continue reading

IBM Cloud announces network expansion with availability zones in six global regions

Learn about our plans to launch 18 availability zones for IBM Cloud in six major regions in North America (Dallas, Texas and Washington, DC), Europe (Germany and UK) and Asia-Pacific (Tokyo and Sydney). You'll also discover our new approach to deliver our full stack of public cloud capabilities in a more highly available, redundant, and geographically dispersed manner.

Continue reading