IBM recently released object storage traces that reflect cloud object storage workloads and contributed these traces to the Storage Networking Industry Association (SNIA).

The traces that IBM contributed include read and write requests made against objects in a cloud-based object storage. These traces can help us understand the behavior of cloud workloads and drive new research and insight into enhancing the cloud. Today, there is a significant amount of academic interest in real data access traces that can be used to investigate various workload aspects. Although file system and block traces are easily available, there were no publicly available access traces for object storage. That’s why IBM decided to make these traces available to the community, and we’re looking forward to seeing the research insights that will follow as a result.  

Using traces to explore cloud cache policies

For example, our team at IBM Research leveraged these traces to explore how classical FIFO and LRU cache policies apply for object storage. In fact, we published the summary in a paper in HotStorage 2020 under the title “It’s Time to Revisit LRU vs. FIFO.” The paper explores modern cache systems that can be deployed on scales undreamed-of just a few years ago.  

With the advent of big data and cloud computing, cache storage can consume upwards of terabytes of data and more. Using these traces, we have been able to contrast different methods for managing a large-scale cache. It enabled us to revisit the question of the effectiveness of the popular LRU cache eviction policy versus the FIFO heuristic, which attempts to offer LRU-like behavior.  

Several past works have considered this question and commonly stipulated that while FIFO is much easier to implement, the improved hit ratio of LRU outweighs this ease-of-use. We found that two main trends call for a reevaluation of this premise.

The first trend is that new caches — such as front-ends to cloud storage — are very large-scale, and this makes managing cache metadata in RAM no longer feasible. The second trend is new types of workloads. Using the insight gained from the traces, we have been able to substantiate this opinion and demonstrate cases where FIFO provides better performance characteristics than the commonly used LRU algorithm.

Insights that can optimize cache research

The object storage traces are a treasure trove of information for optimizing cloud workloads. They provide insight for cache research, in general, and particularly for large-scale hybrid cloud caches. While IBM has been using these traces to reevaluate eviction policies for large-scale caches, other groups have expressed interest in these traces for other uses. For example, an academic group at a leading university has been using the traces to study the effect of different cache policies on variable-sized data.

A closer look at what’s inside the object storage traces

The IBM object storage traces are a set of anonymized traces that IBM is making available to the broader research community. The trace data set is composed of 98 traces containing around 1.6 billion requests for 342 million unique objects. The complete trace data set is about 88 GB in size. Each trace contains the REST operations issued against a single bucket in IBM Cloud Object Storage during the same single week in 2019. Each trace was selected based on a single criterion — that it contains some read (i.e., GET OBJECT) requests. Each trace contains GET OBJECT, PUT OBJECT, HEAD OBJECT, DELETE OBJECT requests taken over a week-long period, where each request includes a timestamp, the request type, the object ID, a starting offset and an ending offset and the total object size. Only successful requests (i.e., that returned a return code of 200) are listed. Originally, this data set was intended to enable the study of cache behavior and, therefore, requests that were not served were of no interest.  

Bucket names are omitted, and objects are represented as IDs generated through a one-way keyed hash function.  

The format of each trace record is <time stamp of request>, <request type>, <object ID>, <optional: size of object>, <optional: beginning offset>, <optional: ending offset>.  The timestamp is the number of milliseconds from the point where we began collecting the traces.

For example: 

  • 1219008 REST.PUT.OBJECT 8d4fcda3d675bac9 1056
  • 1221974 REST.HEAD.OBJECT 39d177fb735ac5df 528
  • 1232437 REST.HEAD.OBJECT 3b8255e0609a700d 1456
  • 1232488 REST.GET.OBJECT 95d363d3fbdc0b03 1168 0 1167
  • 1234545 REST.GET.OBJECT bfc07f9981aa6a5a 528 0 527
  • 1256364 REST.HEAD.OBJECT c27efddbeef2b638 12752
  • 1256491 REST.HEAD.OBJECT 13943e909692962f 9760
  • 1256556 REST.GET.OBJECT 884ba9b0c6d1fe97 23872 0 23871
  • 1256584 REST.HEAD.OBJECT d86b7bfefc63995d 12592

Learn more

The IBM Cloud Object Storage traces are a set of object storage workload traces that can facilitate cache, object storage, and cloud research. The traces are now available on the SNIA site. We hope you will use them and find them as insightful as we have. You can find them here.  

Categories

More from

IBM TechXchange underscores the importance of AI skilling and partner innovation

3 min read - Generative AI and large language models are poised to impact how we all access and use information. But as organizations race to adopt these new technologies for business, it requires a global ecosystem of partners with industry expertise to identify the right enterprise use-cases for AI and the technical skills to implement the technology. During TechXchange, IBM's premier technical learning event in Las Vegas last week, IBM Partner Plus members including our Strategic Partners, resellers, software vendors, distributors and service…

Kubernetes version 1.28 now available in IBM Cloud Kubernetes Service

2 min read - We are excited to announce the availability of Kubernetes version 1.28 for your clusters that are running in IBM Cloud Kubernetes Service. This is our 23rd release of Kubernetes. With our Kubernetes service, you can easily upgrade your clusters without the need for deep Kubernetes knowledge. When you deploy new clusters, the default Kubernetes version remains 1.27 (soon to be 1.28); you can also choose to immediately deploy version 1.28. Learn more about deploying clusters here. Kubernetes version 1.28 In…

“Teams will get smarter and faster”: A conversation with Eli Manning

3 min read - For the last three years, IBM has worked with two-time champion Eli Manning to help spread the word about our partnership with ESPN. The nature of that partnership is pretty technical, involving powerful AI models—built with watsonx—that analyze massive data sets to generate insights that help ESPN Fantasy Football team owners manage their teams. Eli has not only helped us promote awareness of these insights, but also to unpack the technology behind them, making it understandable and accessible to millions.…

Temenos brings innovative payments capabilities to IBM Cloud to help banks transform

3 min read - The payments ecosystem is at an inflection point for transformation, and we believe now is the time for change. As banks look to modernize their payments journeys, Temenos Payments Hub has become the first dedicated payments solution to deliver innovative payments capabilities on the IBM Cloud for Financial Services®—an industry-specific platform designed to accelerate financial institutions' digital transformations with security at the forefront. This is the latest initiative in our long history together helping clients transform. With the Temenos Payments…