March 15, 2021 By Effi Ofer
Danny Harnik
Ronen Kat
3 min read

IBM recently released object storage traces that reflect cloud object storage workloads and contributed these traces to the Storage Networking Industry Association (SNIA).

The traces that IBM contributed include read and write requests made against objects in a cloud-based object storage. These traces can help us understand the behavior of cloud workloads and drive new research and insight into enhancing the cloud. Today, there is a significant amount of academic interest in real data access traces that can be used to investigate various workload aspects. Although file system and block traces are easily available, there were no publicly available access traces for object storage. That’s why IBM decided to make these traces available to the community, and we’re looking forward to seeing the research insights that will follow as a result.  

Using traces to explore cloud cache policies

For example, our team at IBM Research leveraged these traces to explore how classical FIFO and LRU cache policies apply for object storage. In fact, we published the summary in a paper in HotStorage 2020 under the title “It’s Time to Revisit LRU vs. FIFO.” The paper explores modern cache systems that can be deployed on scales undreamed-of just a few years ago.  

With the advent of big data and cloud computing, cache storage can consume upwards of terabytes of data and more. Using these traces, we have been able to contrast different methods for managing a large-scale cache. It enabled us to revisit the question of the effectiveness of the popular LRU cache eviction policy versus the FIFO heuristic, which attempts to offer LRU-like behavior.  

Several past works have considered this question and commonly stipulated that while FIFO is much easier to implement, the improved hit ratio of LRU outweighs this ease-of-use. We found that two main trends call for a reevaluation of this premise.

The first trend is that new caches — such as front-ends to cloud storage — are very large-scale, and this makes managing cache metadata in RAM no longer feasible. The second trend is new types of workloads. Using the insight gained from the traces, we have been able to substantiate this opinion and demonstrate cases where FIFO provides better performance characteristics than the commonly used LRU algorithm.

Insights that can optimize cache research

The object storage traces are a treasure trove of information for optimizing cloud workloads. They provide insight for cache research, in general, and particularly for large-scale hybrid cloud caches. While IBM has been using these traces to reevaluate eviction policies for large-scale caches, other groups have expressed interest in these traces for other uses. For example, an academic group at a leading university has been using the traces to study the effect of different cache policies on variable-sized data.

A closer look at what’s inside the object storage traces

The IBM object storage traces are a set of anonymized traces that IBM is making available to the broader research community. The trace data set is composed of 98 traces containing around 1.6 billion requests for 342 million unique objects. The complete trace data set is about 88 GB in size. Each trace contains the REST operations issued against a single bucket in IBM Cloud Object Storage during the same single week in 2019. Each trace was selected based on a single criterion — that it contains some read (i.e., GET OBJECT) requests. Each trace contains GET OBJECT, PUT OBJECT, HEAD OBJECT, DELETE OBJECT requests taken over a week-long period, where each request includes a timestamp, the request type, the object ID, a starting offset and an ending offset and the total object size. Only successful requests (i.e., that returned a return code of 200) are listed. Originally, this data set was intended to enable the study of cache behavior and, therefore, requests that were not served were of no interest.  

Bucket names are omitted, and objects are represented as IDs generated through a one-way keyed hash function.  

The format of each trace record is <time stamp of request>, <request type>, <object ID>, <optional: size of object>, <optional: beginning offset>, <optional: ending offset>.  The timestamp is the number of milliseconds from the point where we began collecting the traces.

For example: 

  • 1219008 REST.PUT.OBJECT 8d4fcda3d675bac9 1056
  • 1221974 REST.HEAD.OBJECT 39d177fb735ac5df 528
  • 1232437 REST.HEAD.OBJECT 3b8255e0609a700d 1456
  • 1232488 REST.GET.OBJECT 95d363d3fbdc0b03 1168 0 1167
  • 1234545 REST.GET.OBJECT bfc07f9981aa6a5a 528 0 527
  • 1256364 REST.HEAD.OBJECT c27efddbeef2b638 12752
  • 1256491 REST.HEAD.OBJECT 13943e909692962f 9760
  • 1256556 REST.GET.OBJECT 884ba9b0c6d1fe97 23872 0 23871
  • 1256584 REST.HEAD.OBJECT d86b7bfefc63995d 12592

Learn more

The IBM Cloud Object Storage traces are a set of object storage workload traces that can facilitate cache, object storage, and cloud research. The traces are now available on the SNIA site. We hope you will use them and find them as insightful as we have. You can find them here.  

Was this article helpful?

More from

IBM Cloud Reference Architectures unleashed

2 min read - The ability to onboard workloads to cloud quickly and seamlessly is paramount to accelerate enterprises digital transformation journey. At IBM Cloud, we're thrilled to introduce the IBM Cloud® Reference Architectures designed to empower clients, technical architects, strategists and partners to revolutionize the way businesses harness the power of the cloud. VPC resiliency: Strengthening your foundation Explore the resilience of IBM Cloud Virtual Private Cloud through our comprehensive resources. Dive into our VPC Resiliency white paper, a blueprint for building robust…

Empower developers to focus on innovation with IBM watsonx

3 min read - In the realm of software development, efficiency and innovation are of paramount importance. As businesses strive to deliver cutting-edge solutions at an unprecedented pace, generative AI is poised to transform every stage of the software development lifecycle (SDLC). A McKinsey study shows that software developers can complete coding tasks up to twice as fast with generative AI. From use case creation to test script generation, generative AI offers a streamlined approach that accelerates development, while maintaining quality. This ground-breaking technology…

Data protection strategy: Key components and best practices

8 min read - Virtually every organization recognizes the power of data to enhance customer and employee experiences and drive better business decisions. Yet, as data becomes more valuable, it's also becoming harder to protect. Companies continue to create more attack surfaces with hybrid models, scattering critical data across cloud, third-party and on-premises locations, while threat actors constantly devise new and creative ways to exploit vulnerabilities. In response, many organizations are focusing more on data protection, only to find a lack of formal guidelines and…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters