Learn how to revive your archives, allowing you to work on log data from IBM Log Analysis with LogDNA using IBM Cloud SQL Query.

IBM Cloud Log Analysis with LogDNA is an easy-to-use service for analyzing logs, alerting on them, or debugging applications. Typically, logging services keep data for only a relatively short amount of time—30 days or less—in an active state of interactive querying and ad-hoc exploration. That’s because logs are constant streams of data—a firehouse of bits and bytes. Unless that data times out, you are going to have either operational issues or a sky-high bill. 

That is why Log Analysis with LogDNA provides an archival functionality, allowing users to have highly performant and active indexing of their log data, while enabling them, over time, to offload that data to IBM Cloud Object Storage to meet compliance/regulatory requirements.

But what if archived logs weren’t just for evidence? What if they were easily accessible data streams that could help drive insights and improvements for application operations, performance, and resiliency? What if you could perform complex timeseries queries on these data streams? With IBM Cloud SQL Query, we can help you turn your archived logs into insights with simple, serverless SQL. 

Read more below on how to revive your archives, allowing you to work on log data from IBM Log Analysis with LogDNA using IBM Cloud SQL Query. 

Archive log data on IBM Cloud Object Storage

IBM Log Analysis with LogDNA keeps your log data for a maximum of 30 days, and if you like to keep your log data longer, LogDNA provides a feature to archive them on IBM Cloud Object Storage. To enable this feature, click on the settings icon within your LogDNA instance interface and fill out the required settings:


Ensure that you have created an IBM Cloud Object Storage instance together with added service credentials beforehand. Within the service credentials, you find the required information, such as API Key, Endpoint, and the Instance ID. Add that information and the created bucket name into the LogDNA archiving settings fields and click save to enable archiving. 

After a maximum of two days, you should see an exported log file stored on IBM Cloud Object Storage as Gzip JSON:


Use IBM SQL Query to work on archived log data

A log file gets stored every day, including all data collected within the last 24 hours. In case your LogDNA instance retrieves a large amount of data, an hourly log data archiving could be enabled. The problem with that option is that a small amount of rows could be stored in more than one zip file. You can find more details in the LogDNA documentation

Use IBM SQL Query to work on this archived log data efficiently in the IBM Cloud. An easy way to start is to create an IBM Cloud SQL Query Lite instance. It limits you to the ability to run just one query at a time and to scan 30 GB maximum per day. When using a Standard plan, you can run up to five queries in parallel and just pay for the data scanned at $5 per TB/Scanned. 

We highly recommend first transforming the json.gz objects to partitioned JSON objects because zipped files cannot be accessed in parallel and, therefore, reading that kind of data is non-performant. You may wonder why compressed files like Parquet are not even better. This would be correct from a performance point of view, but as the schema data of log files is typically not stable, this could lead to problems reading data from different days. When using JSON, the underlying engine of SQL Query (Apache Spark) always analyzes the used schema and therefore avoids such consistency problems. 

In case you are quite sure that your log schema data is stable, Parquet as an output format would be more efficient both by increasing performance and lowering amount of data scanned. 

Here is an example of an SQL Query statement to transform your zipped logDNA data into partitioned JSON objects:

SELECT 
     *, 
     date_format(from_unixtime(_source._ts / 1000, 'yyyy-MM-dd HH:mm:ss'), 'yyyy') AS _year, 
     dayofyear(from_unixtime(_source._ts / 1000, 'yyyy-MM-dd HH:mm:ss')) AS _dayofyear,         
     date_format(from_unixtime(_source._ts / 1000, 'yyyy-MM-dd HH:mm:ss'), 'HH') AS _hour 
FROM 
     cos://us-south/sql-query-logdna-dev/a591844d24.2019-07-17.72.json.gz STORED AS JSON 
INTO 
     cos://us-south/result/logDNA STORED AS JSON PARTITIONED BY (_year, _dayofyear, _hour)
Scroll to view full table

In the statement, the timestamp information—_ts—of each row is used to generate the year, day of year, and hour information. The result of the above query stored in IBM Cloud Object Storage reflects the specified partitioning as Hive style partitioning. For example, you could see _hour=00 in the object name, which means that all log data belonging to that hour is included in that object. This conversion step can be automated using cloud functions. The details can be found in our data-lake Github repository.


After the above preparation steps, you can start to run SQL on the log data. A simple example is to find all ERROR log rows in the timeframe between 7 am and 8 am on July 17th. 

Only the required partitions are scanned in the following statement:

SELECT _source.data
FROM cos://us-south/result/logDNA/ STORED AS JSON
where _dayofyear=198 and _hour=07 and _source.level = "ERROR"
LIMIT 50
INTO cos://us-south/result/errorextract STORED AS JSON
Scroll to view full table

SQL Query also provides native timeseries constructs for performing complex timeseries functions like segmentation, temporal statistics, temporal joins, interpolation, and forecasting. A simple relevant query in this context is to identify hourly averages. This is illustrated in the below statement (that takes as input a timeseries view on the log data; you can find further details on this in the SQL Query documentation):

SELECT node, status,
     TS_SEG_AVG(
          TS_SEGMENT_BY_TIME(ts_index(ts, 1), 3600*1000, 3600*1000))
FROM da_ts_table
Scroll to view full table

This statement could easily be changed to apply very different transformations. For instance, getting the average request time when a status code 500 occurs:

TS_SEG_AVG(
     TS_SEGMENT_BY_ANCHOR(ts, TS_ANCHOR_INDEX(TS_ANCHOR_GTE(500.0), 0), 1800*1000, 1800*1000))
Scroll to view full table

Now you have the base foundation and pipeline for performing highly scalable analytics on your log data with simple SQL. If you want to take this example to the next level, feel free to explore our Jupyter Notebook

Learn more

 

More from Cloud

Connected products at the edge

6 min read - There are many overlapping business usage scenarios involving both the disciplines of the Internet of Things (IoT) and edge computing. But there is one very practical and promising use case that has been commonly deployed without many people thinking about it: connected products. This use case involves devices and equipment embedded with sensors, software and connectivity that exchange data with other products, operators or environments in real-time. In this blog post, we will look at the frequently overlooked phenomenon of…

6 min read

SRG Technology drives global software services with IBM Cloud VPC under the hood

4 min read - Headquartered in Ft. Lauderdale, Florida, SRG Technology LLC. (SRGT) is a software development company supporting the education, healthcare and travel industries. Their team creates data systems that deliver the right data in real time to customers around the globe. Whether those customers are medical offices and hospitals, schools or school districts, government agencies, or individual small businesses, SRGT addresses a wide spectrum of software services and technology needs with round-the-clock innovative thinking and fresh approaches to modern data problems. The…

4 min read

IBM Tech Now: May 30, 2023

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 77 This episode, we're covering the following topics: IBM Watson Code Assistant IBM Hybrid Cloud Mesh IBM Event Automation Stay plugged in You can check out the IBM Blog Announcements for a full rundown…

< 1 min read

Strengthening cybersecurity in life sciences with IBM and AWS

7 min read - Cloud is transforming the way life sciences organizations are doing business. Cloud computing offers the potential to redefine and personalize customer relationships, transform and optimize operations, improve governance and transparency, and expand business agility and capability. Leading life science companies are leveraging cloud for innovation around operational, revenue and business models. According to a report on mapping the cloud maturity curve from the EIU, 48% of industry executives said cloud has improved data access, analysis and utilization, 45% say cloud…

7 min read