How to develop a simple data pipeline on IBM Cloud.
Are you looking for options to reduce the operations effort and cost of ETL/ELT processes? In this tutorial, we will demonstrate the development of a simple data pipeline in a few steps on IBM Cloud. You will learn how to use these managed cloud services to build serverless data pipelines billed by consumption. This will enable you to get rid of your on-premises load infrastructure and leverage Kubernetes jobs and Spark as a Service to build and automate ETL/ELT processes.
What is a serverless data pipeline?
A simple definition of serverless is developing or managing an application without managing the servers. For a deep dive into all things serverless, check out our Learn Hub article "What is Serverless Computing?"
A serverless data pipeline, then, can be defined as a workflow that transfers (or ingests) data from one place to another without managing the servers.
The following are the prerequisites required for following this tutorial — you must have these services instantiated on your IBM Cloud account:
Once the resources are provisioned, the below process helps build a simple data pipeline from a JSON-based database to a CSV or Db2 warehouse:
As part of this tutorial, we created a simple database called
users on IBM Cloudant with the following structure:
We are using the
_all_docs endpoint of Cloudant to extract the entire database. To make this exercise simpler, we are going to use a
curl command. We use the following to extract the data into a file extract.json:
Depending on the database size and the choice of programming language, one can write a program to extract the data from Cloudant instead.
As this is a JSON structure, the data cannot be visualized directly and performing an analysis is difficult. Common tools used for analysis are relational databases like Db2, Db2 Warehouse on Cloud or visualization tools like Cognos Analytics.
Data upload to IBM Cloud Object Storage (COS)
The next step in the process is to make this data available for IBM Cloud SQL Query to transform it into a structure that can be analyzed. To perform this step, the data has to be pushed into a Cloud Object Storage service. We are using the ibmcloud cos cli to push the data file into a COS bucket called cos-demo. The command and output of the command are shown below:
Once the file is available in COS, the following SQL queries can be used to transform the data into a CSV file or a Db2 table.
Data transformation in SQL Query to CSV
Data Transformation in SQL Query to Db2 table
Data analysis in Db2 on Cloud
Once the data is available in Db2, the data can be queried or connected to a visualization tool. As an example, the below query shows the total number of users created on each day:
Please note that we are using Db2 in the example, and there is a Lite version available. The same query can be run against Db2 Warehouse, as well.
The above metrics are collected on a
local computer using an IBM Free (Lite) Account. The upload and download speed varies depending on the network bandwidth. Also, these metrics are just for understanding how quickly a serverless data pipeline can be developed, but cannot be used as benchmarking the services.
Sign up for Free IBM Cloud Account and try out the above example.