Building Your First Serverless Data Pipeline on IBM Cloud

2 min read

How to develop a simple data pipeline on IBM Cloud.

Are you looking for options to reduce the operations effort and cost of ETL/ELT processes? In this tutorial, we will demonstrate the development of a simple data pipeline in a few steps on IBM Cloud. You will learn how to use these managed cloud services to build serverless data pipelines billed by consumption. This will enable you to get rid of your on-premises load infrastructure and leverage Kubernetes jobs and Spark as a Service to build and automate ETL/ELT processes.

What is a serverless data pipeline?

A simple definition of serverless is developing or managing an application without managing the servers. For a deep dive into all things serverless, check out our Learn Hub article "What is Serverless Computing?

A serverless data pipeline, then, can be defined as a workflow that transfers (or ingests) data from one place to another without managing the servers.

Prerequisites

The following are the prerequisites required for following this tutorial — you must have these services instantiated on your IBM Cloud account:

Once the resources are provisioned, the below process helps build a simple data pipeline from a JSON-based database to a CSV or Db2 warehouse: 

Once the resources are provisioned, the below process helps build a simple data pipeline from a JSON-based database to a CSV or Db2 warehouse: 

Data extraction  

As part of this tutorial, we created a simple database called users on IBM Cloudant with the following structure:

As part of this tutorial, we created a simple database called users on IBM Cloudant with the following structure:

We are using the _all_docs endpoint of Cloudant to extract the entire database. To make this exercise simpler, we are going to use a curl command. We use the following to extract the data into a file extract.json:

curl --location --request GET 'https://<hostname>/users/_all_docs?include_docs=true' --header 'Content-Type: application/json' -u'<username>:<password>' | jq -c . > extract.json
We are using the _all_docs endpoint of Cloudant to extract the entire database. To make this exercise simpler, we are going to use a curl command. We use the following to extract the data into a file extract.json:

Depending on the database size and the choice of programming language, one can write a program to extract the data from Cloudant instead. 

As this is a JSON structure, the data cannot be visualized directly and performing an analysis is difficult. Common tools used for analysis are relational databases like Db2, Db2 Warehouse on Cloud or visualization tools like Cognos Analytics. 

Data upload to IBM Cloud Object Storage (COS) 

The next step in the process is to make this data available for IBM Cloud SQL Query to transform it into a structure that can be analyzed. To perform this step, the data has to be pushed into a Cloud Object Storage service. We are using the ibmcloud cos cli to push the data file into a COS bucket called cos-demo. The command and output of the command are shown below:

ibmcloud cos object-put --bucket cos-demo --key <target-file-name> --body <source-file-name>--content-encoding utf-8
We are using the ibmcloud cos cli to push the data file into a COS bucket called cos-demo. The command and output of the command are shown below:
We are using the ibmcloud cos cli to push the data file into a COS bucket called cos-demo. The command and output of the command are shown below:

Data transformation

Once the file is available in COS, the following SQL queries can be used to transform the data into a CSV file or a Db2 table.

Data transformation in SQL Query to CSV

select
usr.columns.doc._id,
usr.columns.doc.username,
usr.columns.doc.created,
usr.columns.doc.lastname,
usr.columns.doc.firstname
FROM
cos://<cos_file_path>/<file_name> STORED AS JSON AS users
LATERAL VIEW EXPLODE(users.rows) usr AS columns
-- INTO clause was automatically added based on the default target
INTO cos://<target_file_path>/  STORED AS CSV
Data transformation in SQL Query to CSV

Data Transformation in SQL Query to Db2 table

select
usr.columns.doc._id,
usr.columns.doc.username,
usr.columns.doc.created,
usr.columns.doc.lastname,
usr.columns.doc.firstname
FROM
cos://<cos_file_path>/<file_name> STORED AS JSON AS users
LATERAL VIEW EXPLODE(users.rows) usr AS columns
INTO <db2-service-crn>/<schemaname>.<tablename>
Data Transformation in SQL Query to Db2 table

Data analysis in Db2 on Cloud

Once the data is available in Db2, the data can be queried or connected to a visualization tool. As an example, the below query shows the total number of users created on each day:

select date("created") as date_created, count(1) as total from CRC40397.USERS_10000 group by date("created")
Data analysis in Db2 on Cloud

Please note that we are using Db2 in the example, and there is a Lite version available. The same query can be run against Db2 Warehouse, as well.

Metrics

Please note that we are using Db2 in the example, and there is a Lite version available. The same query can be run against Db2 Warehouse, as well.

The above metrics are collected on a local computer using an IBM Free (Lite) Account. The upload and download speed varies depending on the network bandwidth. Also, these metrics are just for understanding how quickly a serverless data pipeline can be developed, but cannot be used as benchmarking the services.

Sign up for Free IBM Cloud Account and try out the above example.

If you have feedback, suggestions or questions about this post, please reach out to us on LinkedIn (@santhosh or @christian).

Be the first to hear about news, product updates, and innovation from IBM Cloud