Data Scraping Made Easy Thanks to IBM Code Engine Jobs

3 min read

How to use IBM Cloud Code Engine with cron-scheduled jobs to build your data lake.

Recently, I began a new side project. It involves mobility data and its analysis and visualization — consider it a data science project. After identifying the right data providers and open data APIs, the next step is to build the data lake. It requires you to regularly download data from various sources, then upload it to a Cloud Object Storage/S3 bucket to build the data lake.

Traditionally, I would have to set up a virtual machine to run the scheduled data scraping jobs. Thanks to serverless compute offerings like IBM Cloud Code Engine, I can cut costs and environmental impact. My scripts are still run based on a cron-controlled schedule, but I only use compute resources for a few seconds per hour. Thus, I pay only a fraction of the earlier costs. All data is uploaded to Cloud Object Storage (COS). From there, it can be easily accessed by data science projects and notebooks hosted in IBM Watson Studio or queried by the SQL Query service (see diagram below).

In the following post, I provide an overview of the project and its components. Thereafter, I discuss technical details like the Dockerfile to containerize my scripting:

My script runs in Code Engine as job, based on a cron schedule.

My script runs in Code Engine as job, based on a cron schedule.

Combining Code Engine, cron and IBM Cloud Object Storage

Instead of utilizing a virtual machine, the scripts for data scraping/retrieving data from open data APIs are deployed to IBM Cloud Code Engine. Code Engine is a fully managed, serverless platform for containerized workloads. That means my scripts need to run within containers. Code Engine distinguishes between applications and jobs. Applications (apps) serve HTTP requests, whereas jobs run one time and then exit — kind of a batch job. This means that a Code Engine job is a good fit for retrieving data and uploading it to storage.

To regularly run the job, I can configure a periodic timer (cron) as even producer that triggers the job run. The job is the scripting, which contacts APIs or websites to retrieve data, maybe postprocesses it and then upload the data to a Cloud Object Storage bucket (see diagram above). There, the data could later be accessed by SQL Query or scripts in a notebook of an IBM Watson Studio analytics project. Accessing the data through other means( e.g., from other apps or scripts inside or outside IBM Cloud) is possible, too.

Technical details

Independent of where the scraper script is run and for which source site, the structure is always the same (see script below). We need to determine the name of the data file, which ideally includes the current date and time. Thereafter, we can retrieve the data and store the result in a file. The data retrieval might require parameters and providing an API key. I usually compress data with gzip before storing it. There are different ways of uploading a file to COS. An easy approach is to utilize the IBM Cloud CLI plugin and its object-put command. It requires you to be logged in to IBM Cloud (using an API key) with a region and a resource group set:

#!/bin/bash
set -ex
# use current date and time for file name
DATE=`date "+%Y%m%d_%H%M"`

# retrieve the data and store it to a file
curl -s -X GET "https://data-platform.example.com/v1/someAPI?${MY_PARAMETERS}" -H "x-api-key: ${MY_DATA_API_KEY}" > ${DATE}.json

# compress the file
gzip ${DATE}.json

# IBM Cloud login and data upload to COS
IBMCLOUD_API_KEY=${IBMCLOUD_APIKEY} ibmcloud login  -g default -r us-south
ibmcloud cos object-put --bucket scooter --key ${DATE}.json.gz --body ${DATE}.json.gz

Script to fetch data from API and upload to Cloud Object Storage as a data lake.

In order to run the above script in a container, we need the IBM Cloud CLI environment. Thus, our Dockerfile is mainly composed of a chain of commands to update the base operating system and then to install the IBM Cloud CLI and the COS plugin. Thereafter, it copies over the above script, which is also run by default:

# Small base image
FROM alpine
# Upgrade the OS, install some common tools and then
# the IBM Cloud CLI and Cloud Object Storage plugin
RUN apk update && apk upgrade && apk add bash curl jq git ncurses && \
    curl -fsSL https://clis.cloud.ibm.com/install/linux | bash && \
    ln -s /usr/local/bin/ibmcloud /usr/local/bin/ic && \
    ibmcloud plugin install cloud-object-storage

COPY script.sh /script.sh
WORKDIR /app
ENTRYPOINT [ "/script.sh" ]

Dockerfile.

With those definitions in place, everything is ready to build the container image and push it to the IBM Cloud Container Registry. Then, it is straightforward to create a job based on it and schedule job runs. For the job, I created environment variables to pass in the IBM Cloud API key and the necessary parameters and key for the data retrieval.

Once everything was in place and supposed to work, I used the IBM Cloud CLI with the COS plugin to check that the automatic, serverless data retrieval was working as expected. The files use an UTC timestamp and have been uploaded every two hours — as configured in my test case:

Listing uploaded data in a Cloud Object Storage bucket.

Listing uploaded data in a Cloud Object Storage bucket.

Conclusions

Using a serverless container platform like IBM Cloud Code Engine, it is possible to easily set up data scraping and retrieval jobs for building up a data lake. By avoiding "always-on" virtual machines and only using computing power when needed, unnecessary resource consumption and costs are avoided. From my experience, it is also easier to set up and it reduces maintenance and security-related work.

If you are interested in learning more about Code Engine, I recommend the following tutorials and blogs:

If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn.

Be the first to hear about news, product updates, and innovation from IBM Cloud