October 27, 2021 By Henrik Loeser 4 min read

How to use IBM Cloud Code Engine with cron-scheduled jobs to build your data lake.

Recently, I began a new side project. It involves mobility data and its analysis and visualization — consider it a data science project. After identifying the right data providers and open data APIs, the next step is to build the data lake. It requires you to regularly download data from various sources, then upload it to a Cloud Object Storage/S3 bucket to build the data lake.

Traditionally, I would have to set up a virtual machine to run the scheduled data scraping jobs. Thanks to serverless compute offerings like IBM Cloud Code Engine, I can cut costs and environmental impact. My scripts are still run based on a cron-controlled schedule, but I only use compute resources for a few seconds per hour. Thus, I pay only a fraction of the earlier costs. All data is uploaded to Cloud Object Storage (COS). From there, it can be easily accessed by data science projects and notebooks hosted in IBM Watson Studio or queried by the SQL Query service (see diagram below).

In the following post, I provide an overview of the project and its components. Thereafter, I discuss technical details like the Dockerfile to containerize my scripting:

My script runs in Code Engine as job, based on a cron schedule.

Combining Code Engine, cron and IBM Cloud Object Storage

Instead of utilizing a virtual machine, the scripts for data scraping/retrieving data from open data APIs are deployed to IBM Cloud Code Engine. Code Engine is a fully managed, serverless platform for containerized workloads. That means my scripts need to run within containers. Code Engine distinguishes between applications and jobs. Applications (apps) serve HTTP requests, whereas jobs run one time and then exit — kind of a batch job. This means that a Code Engine job is a good fit for retrieving data and uploading it to storage.

To regularly run the job, I can configure a periodic timer (cron) as even producer that triggers the job run. The job is the scripting, which contacts APIs or websites to retrieve data, maybe postprocesses it and then upload the data to a Cloud Object Storage bucket (see diagram above). There, the data could later be accessed by SQL Query or scripts in a notebook of an IBM Watson Studio analytics project. Accessing the data through other means( e.g., from other apps or scripts inside or outside IBM Cloud) is possible, too.

Technical details

Independent of where the scraper script is run and for which source site, the structure is always the same (see script below). We need to determine the name of the data file, which ideally includes the current date and time. Thereafter, we can retrieve the data and store the result in a file. The data retrieval might require parameters and providing an API key. I usually compress data with gzip before storing it. There are different ways of uploading a file to COS. An easy approach is to utilize the IBM Cloud CLI plugin and its object-put command. It requires you to be logged in to IBM Cloud (using an API key) with a region and a resource group set:

set -ex
# use current date and time for file name
DATE=`date "+%Y%m%d_%H%M"`

# retrieve the data and store it to a file
curl -s -X GET "https://data-platform.example.com/v1/someAPI?${MY_PARAMETERS}" -H "x-api-key: ${MY_DATA_API_KEY}" > ${DATE}.json

# compress the file
gzip ${DATE}.json

# IBM Cloud login and data upload to COS
IBMCLOUD_API_KEY=${IBMCLOUD_APIKEY} ibmcloud login  -g default -r us-south
ibmcloud cos object-put --bucket scooter --key ${DATE}.json.gz --body ${DATE}.json.gz

Script to fetch data from API and upload to Cloud Object Storage as a data lake.

In order to run the above script in a container, we need the IBM Cloud CLI environment. Thus, our Dockerfile is mainly composed of a chain of commands to update the base operating system and then to install the IBM Cloud CLI and the COS plugin. Thereafter, it copies over the above script, which is also run by default:

# Small base image
FROM alpine
# Upgrade the OS, install some common tools and then
# the IBM Cloud CLI and Cloud Object Storage plugin
RUN apk update && apk upgrade && apk add bash curl jq git ncurses && \
    curl -fsSL https://clis.cloud.ibm.com/install/linux | bash && \
    ln -s /usr/local/bin/ibmcloud /usr/local/bin/ic && \
    ibmcloud plugin install cloud-object-storage

COPY script.sh /script.sh
ENTRYPOINT [ "/script.sh" ]


With those definitions in place, everything is ready to build the container image and push it to the IBM Cloud Container Registry. Then, it is straightforward to create a job based on it and schedule job runs. For the job, I created environment variables to pass in the IBM Cloud API key and the necessary parameters and key for the data retrieval.

Once everything was in place and supposed to work, I used the IBM Cloud CLI with the COS plugin to check that the automatic, serverless data retrieval was working as expected. The files use an UTC timestamp and have been uploaded every two hours — as configured in my test case:

Listing uploaded data in a Cloud Object Storage bucket.


Using a serverless container platform like IBM Cloud Code Engine, it is possible to easily set up data scraping and retrieval jobs for building up a data lake. By avoiding “always-on” virtual machines and only using computing power when needed, unnecessary resource consumption and costs are avoided. From my experience, it is also easier to set up and it reduces maintenance and security-related work.

If you are interested in learning more about Code Engine, I recommend the following tutorials and blogs:

If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn.

Was this article helpful?

More from Cloud

The history of the central processing unit (CPU)

10 min read - The central processing unit (CPU) is the computer’s brain. It handles the assignment and processing of tasks, in addition to functions that make a computer run. There’s no way to overstate the importance of the CPU to computing. Virtually all computer systems contain, at the least, some type of basic CPU. Regardless of whether they’re used in personal computers (PCs), laptops, tablets, smartphones or even in supercomputers whose output is so strong it must be measured in floating-point operations per…

A clear path to value: Overcome challenges on your FinOps journey 

3 min read - In recent years, cloud adoption services have accelerated, with companies increasingly moving from traditional on-premises hosting to public cloud solutions. However, the rise of hybrid and multi-cloud patterns has led to challenges in optimizing value and controlling cloud expenditure, resulting in a shift from capital to operational expenses.   According to a Gartner report, cloud operational expenses are expected to surpass traditional IT spending, reflecting the ongoing transformation in expenditure patterns by 2025. FinOps is an evolving cloud financial management discipline…

IBM Power8 end of service: What are my options?

3 min read - IBM Power8® generation of IBM Power Systems was introduced ten years ago and it is now time to retire that generation. The end-of-service (EoS) support for the entire IBM Power8 server line is scheduled for this year, commencing in March 2024 and concluding in October 2024. EoS dates vary by model: 31 March 2024: maintenance expires for Power Systems S812LC, S822, S822L, 822LC, 824 and 824L. 31 May 2024: maintenance expires for Power Systems S812L, S814 and 822LC. 31 October…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters