October 23, 2020 By Torsten Steinbach 2 min read

A critical factor for smart business decisions is learning from the behaviour of your applications and users.

The information that fuels such learning is available in the logs generated by your solution stack. Way too often, however, it is still quite hard to consume these logs with analytic frameworks and algorithms.

In IBM Cloud, we have established a pattern of using serverless SQL jobs to process and analyze log data that has been archived to cloud object storage. IBM Log Analysis with LogDNA is the standard logging service of IBM Cloud, and it supports archiving to object storage out of the box. In a similar way, the auditing service called Cloud Activity Tracker is built on top of LogDNA and allows you to archive and analyze auditing records in such a way.

A frequent hurdle that many users face in this process is the fact that log archive files are sometimes very large, which makes the processing and analytics with the SQL service very inefficient and slow. This is further amplified by the fact that log archives are often stored with gzip compression. The problem here is that gzip is a non-splittable compression codec, so that the Spark-based scale-out serverless SQL service in IBM Cloud cannot read the large log archive file in parallel, giving away a lot of performance benefits.

A new solution: IBM Cloud Code Engine

To overcome this hurdle you can use another brand-new serverless runtime feature in IBM Cloud — the IBM Cloud Code Engine, which was recently launched and is currently available to everyone in open beta. It provides a very flexible way to run any code in a serverless fashion. It can run code directly from the source or from docker images that you have prepared.

We have just published a docker image to split and recompress large log archives into splittable compression codec. You can use it as is and run as a serverless batch job in IBM Cloud Code Engine. The image source and detailed description on how to deploy and run it with Code Engine can be found here.

Step-by-step instructions

The basic steps are quite straight forward:

  1. Create a Code Engine project.
  2. Create a batch job definition referencing the docker image.
  3. Set object storage bucket, object name, and credentials for both — the large input log archive and the split output.
  4. Submit the job

By running the split job serverless directly inside the cloud, it is close to the data in object storage and can use private endpoints to read and write. This way, you can run the entire process (read, decompress, split, compress, write) in close to a minute for a 1 GB large compressed log archive.

This efficient and serverless splitting of log archives paves the path for a full serverless log processing and analytics pipeline using SQL. The following illustrates the entire serverless log processing pipeline:

Learn more

Was this article helpful?
YesNo

More from Cloud

Apache Kafka use cases: Driving innovation across diverse industries

6 min read - Apache Kafka is an open-source, distributed streaming platform that allows developers to build real-time, event-driven applications. With Apache Kafka, developers can build applications that continuously use streaming data records and deliver real-time experiences to users. Whether checking an account balance, streaming Netflix or browsing LinkedIn, today’s users expect near real-time experiences from apps. Apache Kafka’s event-driven architecture was designed to store data and broadcast events in real-time, making it both a message broker and a storage unit that enables real-time…

Primary storage vs. secondary storage: What’s the difference?

6 min read - What is primary storage? Computer memory is prioritized according to how often that memory is required for use in carrying out operating functions. Primary storage is the means of containing primary memory (or main memory), which is the computer’s working memory and major operational component. The main or primary memory is also called “main storage” or “internal memory.” It holds relatively concise amounts of data, which the computer can access as it functions. Because primary memory is so frequently accessed,…

Cloud investments soar as AI advances

3 min read - These days, cloud news often gets overshadowed by anything and everything related to AI. The truth is they go hand-in-hand since many enterprises use cloud computing to deliver AI and generative AI at scale. "Hybrid cloud and AI are two sides of the same coin because it's all about the data," said Ric Lewis, IBM’s SVP of Infrastructure, at Think 2024. To function well, generative AI systems need to access the data that feeds its models wherever it resides. Enter…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters