Analyze LogDNA Log Data on IBM Cloud Object Storage Using IBM Cloud SQL Query

4 min read

Learn how to revive your archives, allowing you to work on log data from IBM Log Analysis with LogDNA using IBM Cloud SQL Query.

IBM Cloud Log Analysis with LogDNA is an easy-to-use service for analyzing logs, alerting on them, or debugging applications. Typically, logging services keep data for only a relatively short amount of time—30 days or less—in an active state of interactive querying and ad-hoc exploration. That's because logs are constant streams of data—a firehouse of bits and bytes. Unless that data times out, you are going to have either operational issues or a sky-high bill. 

That is why Log Analysis with LogDNA provides an archival functionality, allowing users to have highly performant and active indexing of their log data, while enabling them, over time, to offload that data to IBM Cloud Object Storage to meet compliance/regulatory requirements.

But what if archived logs weren't just for evidence? What if they were easily accessible data streams that could help drive insights and improvements for application operations, performance, and resiliency? What if you could perform complex timeseries queries on these data streams? With IBM Cloud SQL Query, we can help you turn your archived logs into insights with simple, serverless SQL. 

Read more below on how to revive your archives, allowing you to work on log data from IBM Log Analysis with LogDNA using IBM Cloud SQL Query. 

Archive log data on IBM Cloud Object Storage

IBM Log Analysis with LogDNA keeps your log data for a maximum of 30 days, and if you like to keep your log data longer, LogDNA provides a feature to archive them on IBM Cloud Object Storage. To enable this feature, click on the settings icon within your LogDNA instance interface and fill out the required settings:

To enable this feature, click on the settings icon within your LogDNA instance interface and fill out the required settings:

Ensure that you have created an IBM Cloud Object Storage instance together with added service credentials beforehand. Within the service credentials, you find the required information, such as API Key, Endpoint, and the Instance ID. Add that information and the created bucket name into the LogDNA archiving settings fields and click save to enable archiving. 

After a maximum of two days, you should see an exported log file stored on IBM Cloud Object Storage as Gzip JSON:

After a maximum of two days, you should see an exported log file stored on IBM Cloud Object Storage as Gzip JSON:

Use IBM SQL Query to work on archived log data

A log file gets stored every day, including all data collected within the last 24 hours. In case your LogDNA instance retrieves a large amount of data, an hourly log data archiving could be enabled. The problem with that option is that a small amount of rows could be stored in more than one zip file. You can find more details in the LogDNA documentation

Use IBM SQL Query to work on this archived log data efficiently in the IBM Cloud. An easy way to start is to create an IBM Cloud SQL Query Lite instance. It limits you to the ability to run just one query at a time and to scan 30 GB maximum per day. When using a Standard plan, you can run up to five queries in parallel and just pay for the data scanned at $5 per TB/Scanned. 

We highly recommend first transforming the json.gz objects to partitioned JSON objects because zipped files cannot be accessed in parallel and, therefore, reading that kind of data is non-performant. You may wonder why compressed files like Parquet are not even better. This would be correct from a performance point of view, but as the schema data of log files is typically not stable, this could lead to problems reading data from different days. When using JSON, the underlying engine of SQL Query (Apache Spark) always analyzes the used schema and therefore avoids such consistency problems. 

In case you are quite sure that your log schema data is stable, Parquet as an output format would be more efficient both by increasing performance and lowering amount of data scanned. 

Here is an example of an SQL Query statement to transform your zipped logDNA data into partitioned JSON objects:

SELECT 
     *, 
     date_format(from_unixtime(_source._ts / 1000, 'yyyy-MM-dd HH:mm:ss'), 'yyyy') AS _year, 
     dayofyear(from_unixtime(_source._ts / 1000, 'yyyy-MM-dd HH:mm:ss')) AS _dayofyear,         
     date_format(from_unixtime(_source._ts / 1000, 'yyyy-MM-dd HH:mm:ss'), 'HH') AS _hour 
FROM 
     cos://us-south/sql-query-logdna-dev/a591844d24.2019-07-17.72.json.gz STORED AS JSON 
INTO 
     cos://us-south/result/logDNA STORED AS JSON PARTITIONED BY (_year, _dayofyear, _hour)

In the statement, the timestamp information—_ts—of each row is used to generate the year, day of year, and hour information. The result of the above query stored in IBM Cloud Object Storage reflects the specified partitioning as Hive style partitioning. For example, you could see _hour=00 in the object name, which means that all log data belonging to that hour is included in that object. This conversion step can be automated using cloud functions. The details can be found in our data-lake Github repository.

After a maximum of two days, you should see an exported log file stored on IBM Cloud Object Storage as Gzip JSON:

After the above preparation steps, you can start to run SQL on the log data. A simple example is to find all ERROR log rows in the timeframe between 7 am and 8 am on July 17th. 

Only the required partitions are scanned in the following statement:

SELECT _source.data
FROM cos://us-south/result/logDNA/ STORED AS JSON
where _dayofyear=198 and _hour=07 and _source.level = "ERROR"
LIMIT 50
INTO cos://us-south/result/errorextract STORED AS JSON

SQL Query also provides native timeseries constructs for performing complex timeseries functions like segmentation, temporal statistics, temporal joins, interpolation, and forecasting. A simple relevant query in this context is to identify hourly averages. This is illustrated in the below statement (that takes as input a timeseries view on the log data; you can find further details on this in the SQL Query documentation):

SELECT node, status,
     TS_SEG_AVG(
          TS_SEGMENT_BY_TIME(ts_index(ts, 1), 3600*1000, 3600*1000))
FROM da_ts_table

This statement could easily be changed to apply very different transformations. For instance, getting the average request time when a status code 500 occurs:

TS_SEG_AVG(
     TS_SEGMENT_BY_ANCHOR(ts, TS_ANCHOR_INDEX(TS_ANCHOR_GTE(500.0), 0), 1800*1000, 1800*1000))

Now you have the base foundation and pipeline for performing highly scalable analytics on your log data with simple SQL. If you want to take this example to the next level, feel free to explore our Jupyter Notebook

Learn more

 

Be the first to hear about news, product updates, and innovation from IBM Cloud