Automate Serverless Data Pipelines for Your Data Warehouse or Data Lakes

4 min read

By: Josh Mintz and Torsten Steinbach

IBM Cloud SQL Query now supports automating the transport and transformation of data from IBM Cloud Object Storage to Db2 on Cloud.

Building with flexible, scalable, cloud native services is fun, right? You know what isn’t fun though? Capturing, storing, filtering, and analyzing all the terabytes of data produced by your applications round the clock, such as audit trails, clickstream logs, or IoT sensor streams.

The lack of a cost-efficient architecture to extract value from your data can offset the beneficial economics of the cloud and leave game-changing insights on the table. That’s why here at IBM Cloud, we want to enable architectures that give you precise control over your storage costs and analytical capabilities. 

New capabilities for IBM Cloud SQL Query

IBM Cloud SQL Query offers users the power of serverless SQL over data stored in IBM Cloud Object Storage. With this latest release, we have added new capabilities to SQL Query that makes it easier than ever to build extensible data pipelines to stem the rising tides of data.

Data professionals can now take advantage of the power and scalability of Apache Spark-based SQL processing and native time series and location data processing in SQL Query to run highly parallelized data pipelines into their IBM Cloud data services.

Data pipelines, in general, reflect an ETL (Extract, Transform, Load) process that allows you to obtain more value from your data across multiple sources via data extraction and data transformation to downstream cloud services.

IBM Cloud SQL Query now automates pipelining of data from IBM Cloud Object Storage to IBM Cloud Databases such as IBM Db2 Warehouse on Cloud or Db2 on Cloud. Because Db2 natively has a constrained set of options to load data from object storage (supporting only CSV data format), IBM SQL Query’s latest offering opens up a broad range of data formats stored in IBM Cloud Object Storage to be seamlessly loaded into Db2, including Parquet.  

Serverless analytics and data transformation, such as those provided by SQL Query, are foundational capabilities of a cloud native data lake. While data warehousing is an established solution for analytics on big data, by connecting your Db2 Warehouse or Db2 on Cloud deployment to your IBM Cloud Object Storage-based data lake, you can gain more precise control of the data quality and lifecycle at orders-of-magnitude lower costs. For example, you can archive petabytes of data into IBM Cloud Object Storage and then only keep the most relevant subset of data in a high-performing database like Db2 Warehouse on Cloud.

Pipedreams or pipelines?

Data pipelines are the lifeline of analytics. The more versatile and automated they are, the easier you can on-ramp the necessary data to your analytic tools and services. With data pipelines, you can connect your data silos to analytics.

All too often, data acquisition and preparation are still a manual data-wrangling process as teams attempt to break down those silos. With a powerful serverless and SQL-based data pipeline, data extraction and data transformation processes can now be easily created, automated, and scaled to your daily data volumes.

With a powerful serverless and SQL-based data pipeline, data extraction and data transformation processes can now be easily created, automated, and scaled to your daily data volumes.

Use case: Application log processing

In practice, many customers struggle with archiving and analyzing access and audit logs over the lifetime of compliance-mandated retention windows. To improve accessibility and lower costs, one might turn to a serverless data lake. Here is how it would look.

Most services on the IBM Cloud platform—like IBM Cloud Foundry Enterprise Edition, IBM Cloud Kubernetes Services, and IBM Cloudant—already automatically integrate Activity Tracker with LogDNA. This audit logging solution can automatically archive data into IBM Cloud Object Storage. Once it lands there, it can be transformed by SQL Query into a more query-friendly format like Apache Parquet and enriched with other data sources, such as your database logs from IBM Cloudant or third-party data sources. Once your audit data is prepared and enriched, it is available for performant ad-hoc analytics, but SQL Query can also move your data to a more performant warehousing service, such as Db2 Warehouse on Cloud, that can then visualized with any JDBC/ODBC driver, like Cognos Analytics.

This ETL reflects an example of effective pipeline architecture for teams building serverless security intelligence stacks with IBM Cloud SQL Query, Cloud Object Storage, and Db2 Warehouse.

Get the tutorial.

Opening the aperture: Advantage of the cloud

These new SQL-driven ETL capabilities make it easier for data professionals to acquire and prepare new data sources in the cloud (such as machine-generated or telemetry data) and to automatically load the data into Db2 to augment and extend your Db2-based analytic solutions.

This provides fully elastic and automated data on-ramp pipelines, from the durability and cost-effectiveness of Object Storage to the performance of the Db2 ecosystem. In practice, users can combine the elasticity and cost benefits of data acquisition and preparation in object storage with the quality of service of enterprise analytics.

Get started with IBM Cloud SQL Query

Learn more about Data Lakes in the IBM Cloud

Be the first to hear about news, product updates, and innovation from IBM Cloud