How-tos

Analyze and visualize open data with Apache Spark

Share this post:

Many government agencies and public administrations offer access to data, contributing to open data. Using IBM Watson Studio with Jupyter Notebooks and Apache Spark it is simple to retrieve, combine and analyze data from different sources. The result can be easily visualized. Learn what it takes with this IBM Cloud solution tutorial.

Architecture: Open Data Analytics

Architecture: Open Data Analytics

Overview

In the tutorial, you are going to use IBM Watson Studio to organize all required resources. Watson Studio serves as glue around the data, cloud object storage, Apache Spark as compute platform, and Jupyter Notebooks. A notebook is an open-source web application that contains live code, equations, visualizations and narrative text.

You are going to combine open data about country population, life expectancy rates and country ISO codes. First, data is loaded into so-called data frames. Then, because data from different sources may have a different format, you tranform the frames. Thereafter, analyze the data using SQL. By utilizing the PixieDust library, even visualizations are easily done. The following screenshot shows how life expectancy rate be country can be depicted on a zoomable map.

Mapping Life Expectancy

Mapping Life Expectancy

Conclusions

With few steps, you can retrieve open data sets from different sources. Then, combine and analyze them in a Jupyter Notebook in Watson Studio and visualize the data. Try it yourself by following this tutorial “Analyze and visualize open data with Apache Spark“. Also, check out the other IBM Cloud solution tutorials in the IBM Cloud documentation.

If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn.

Technical Offering Manager / Developer Advocate

More How-tos stories
February 13, 2019

Simplify and Automate Deployments Using GitOps with IBM Multicloud Manager 3.1.2

Use Argo CD, a GitOps continuous delivery tool for Kubernetes, and IBM Multicloud Manager to achieve declarative and automated deployment of applications to multiple Kubernetes clusters.

Continue reading

February 11, 2019

Solving Business Problems with Splunk on IBM Cloud Kubernetes Service

In this tutorial, we will install Splunk Connect for Kubernetes into an existing Splunk instance. Splunk Connect for Kubernetes provides a way to import and search your Kubernetes logging, object, and metrics data in Splunk.

Continue reading

February 8, 2019

A How-To for Migrating Redis to IBM Cloud Databases for Redis

If you’re moving your data over to IBM Cloud Databases for Redis, you’ll need to take some steps to successfully migrate all of your data. We’ve got you covered. In this post, we’ll show you a quick way to start migrating your data across to Databases for Redis, whether your database is on-premise or in the cloud.

Continue reading