A critical factor for smart business decisions is learning from the behaviour of your applications and users.
The information that fuels such learning is available in the logs generated by your solution stack. Way too often, however, it is still quite hard to consume these logs with analytic frameworks and algorithms.
In IBM Cloud, we have established a pattern of using serverless SQL jobs to process and analyze log data that has been archived to cloud object storage. IBM Log Analysis with LogDNA is the standard logging service of IBM Cloud, and it supports archiving to object storage out of the box. In a similar way, the auditing service called Cloud Activity Tracker is built on top of LogDNA and allows you to archive and analyze auditing records in such a way.
A frequent hurdle that many users face in this process is the fact that log archive files are sometimes very large, which makes the processing and analytics with the SQL service very inefficient and slow. This is further amplified by the fact that log archives are often stored with gzip compression. The problem here is that gzip is a non-splittable compression codec, so that the Spark-based scale-out serverless SQL service in IBM Cloud cannot read the large log archive file in parallel, giving away a lot of performance benefits.
A new solution: IBM Cloud Code Engine
To overcome this hurdle you can use another brand-new serverless runtime feature in IBM Cloud — the IBM Cloud Code Engine, which was recently launched and is currently available to everyone in open beta. It provides a very flexible way to run any code in a serverless fashion. It can run code directly from the source or from docker images that you have prepared.
We have just published a docker image to split and recompress large log archives into splittable compression codec. You can use it as is and run as a serverless batch job in IBM Cloud Code Engine. The image source and detailed description on how to deploy and run it with Code Engine can be found here.
Step-by-step instructions
The basic steps are quite straight forward:
- Create a Code Engine project.
- Create a batch job definition referencing the docker image.
- Set object storage bucket, object name, and credentials for both — the large input log archive and the split output.
- Submit the job
By running the split job serverless directly inside the cloud, it is close to the data in object storage and can use private endpoints to read and write. This way, you can run the entire process (read, decompress, split, compress, write) in close to a minute for a 1 GB large compressed log archive.
This efficient and serverless splitting of log archives paves the path for a full serverless log processing and analytics pipeline using SQL. The following illustrates the entire serverless log processing pipeline: