Data warehouses are a critical component of any organization’s technology ecosystem. They provide the backbone for a range of use cases such as business intelligence (BI) reporting, dashboarding, and machine-learning (ML)-based predictive analytics that enable faster decision making and insights. The next generation of IBM Db2 Warehouse brings a host of new capabilities that add cloud object storage support with advanced caching to deliver 4x faster query performance than previously, while cutting storage costs by 34x1.

Read the GA announcement

The introduction of native support for cloud object storage (based on Amazon S3) for Db2 column-organized tables, coupled with our advanced caching technology, helps customers significantly reduce their storage costs and improve performance compared to the current generation service. The adoption of cloud object storage as the data persistence layer also enables users to move to a consumption-based model for storage, providing for automatic and unlimited storage scaling.

This post highlights the new storage and caching capabilities, and the results we are seeing from our internal benchmarks, which quantify the price-performance improvements.

Cloud object storage support

The next generation of Db2 Warehouse introduces support for cloud object storage as a new storage medium within its storage hierarchy. It allows users to store Db2 column-organized tables in object storage in Db2’s highly optimized native page format, all while maintaining full SQL compatibility and capability. Users can leverage both the existing high performance cloud block storage alongside the new cloud object storage support with advanced multi-tier NVMe caching, enabling a simple path towards adoption of the object storage medium for existing databases. 

The following diagram provides a high-level overview of the Db2 Warehouse Gen3 storage architecture:

As shown above, in addition to the traditional network-attached block storage, there is a new multi-tier storage architecture that consists to two levels:

  1. Cloud object storage based on Amazon S3 — Objects associated with each Db2 partition are stored in single pool of petabyte-scale, object storage provided by public cloud providers.
  2. Local NVMe cache — A new layer of local storage supported by high-performance NVMe disks that are directly attached to the compute node and provide significantly faster disk I/O performance than block or object storage.

In this new architecture, we have extended the existing buffer pool caching capabilities of Db2 Warehouse with a proprietary multi-tier cache. This cache extends the existing dynamic in-memory caching capabilities, with a compute local caching area supported by high-performance NVMe disks. This allows Db2 Warehouse to cache larger datasets within the combined cache thereby improving both individual query performance and overall workload throughput.

Performance benchmarks

In this section, we show results from our internal benchmarking of Db2 Warehouse Gen3. The results demonstrate that we were able to achieve roughly 4x1 faster query performance compared to the previous generation thanks to using cloud object storage optimized by the new multi-tier cloud storage layer instead of storing data on network-attached block storage. Additionally, moving the cloud storage from block to object storage results in a 34x reduction in cloud storage costs.

For these tests we set up two equivalent environments with 24 database partitions on two AWS EC2 nodes, each with 48 cores, 768 GB memory and a 25 Gbps network interface. In the case of the Db2 Warehouse Gen3 environment, this adds 4 NVMe drives per node for a total of 3.6 TB, with 60% allocated to the on-disk cache (180 GB per database partition, or 2.16TB total).

In the first set of tests, we ran our Big Data Insight (BDI) concurrent query workload on a 10TB database with 16 clients. The BDI workload is an IBM-defined workload that models a day in the life of a Business Intelligence application. The workload is based on a retail database with in-store, on-line, and catalog sales of merchandise. Three types of users are represented in the workload, running three types of queries:

  • Returns dashboard analysts generate queries that investigate the rates of return and impact on the business bottom line.
  • Sales report analysts generate sales reports to understand the profitability of the enterprise.
  • Deep-dive analysts (data scientists) run deep-dive analytics to answer questions identified by the returns dashboard and sales report analysts.

For this 16-client test, 1 client was performing deep dive analytic queries (5 complex queries), 5 clients were performing sales report queries (50 intermediate complexity queries) and 10 clients were performing dashboard queries (140 simple complexity queries). All runs were measured from cold start (i.e., no cache warmup, both for the in-memory buffer pool and the multi-tier NVMe cache). These runs show 4x faster query performance results for the end-to-end execution time of the mixed workload (213 minutes elapsed for the previous generation, and only 51 minutes for the new generation).

The significant difference in query performance is attributed to the efficiency gained through our multi-tier storage layer that intelligently clusters the data into large blocks designed to minimize the high-latency access to the cloud object storage. This enables a very fast warm up of the NVMe cache, enabling us to capitalize on the significant difference in performance between the NVMe disks and the network-attached block storage to deliver maximum performance. During these tests, both CPU and memory capacity were identical for both tests.

In the second set of tests, we ran a single stream power test based on the 99 queries of the TPC-DS workload also at the 10 TB scale. In these results, the total speedup achieved with the Db2 Warehouse Gen3 was 1.75x when compared with the previous generation. Because a single query is executed at a time, the difference in performance is less significant. The network-attached block storage is able to maintain its best performance due to lower utilization when compared to concurrent workloads like BDI, and the warmup cost for our next generation tier cache is prolonged through single stream access. Even so, the new generation storage won handily. Once the NVMe cache is warm, a re-run of the 99 queries achieves a 4.5x average performance speedup per query compared to the previous generation.

Cloud storage cost savings

The use of tiered object storage in Db2 Warehouse Gen3 not only achieves these impressive 4x query performance improvements, but also reduces cloud storage costs by a factor of 34x, resulting in a significant improvement in the price performance ratio when compared to the previous generation using network-attached block storage.


Db2 Warehouse Gen3 delivers an enhanced approach to cloud data warehousing, especially for always-on, mission-critical analytics workloads. The results shared in this post show that our advanced multi-tier caching technology together with the automatic and unlimited scaling of object storage not only led to significant query performance improvements (4x faster), but also massive cloud storage cost savings (34x cheaper). If you are looking for a highly reliable, high-performance cloud data warehouse with industry leading price performance, try Db2 Warehouse for free today.

Try Db2 Warehouse for free today

1. Running IBM Big Data Insights concurrent query benchmark on two equivalent Db2 Warehouse environments with 24 database partitions on two EC2 nodes, each with 48 cores, 768 GB memory and a 25 Gbps network interface; one environment did not use the caching capability and was used as a baseline. Result: A 4x increase in query speed using the new capability. Storage cost reduction derived from price for cloud object storage, which is priced 34x cheaper than SSD-based block storage.


More from Analytics

Data science vs data analytics: Unpacking the differences

5 min read - Though you may encounter the terms “data science” and “data analytics” being used interchangeably in conversations or online, they refer to two distinctly different concepts. Data science is an area of expertise that combines many disciplines such as mathematics, computer science, software engineering and statistics. It focuses on data collection and management of large-scale structured and unstructured data for various academic and business applications. Meanwhile, data analytics is the act of examining datasets to extract value and find answers to…

Financial planning & budgeting: Navigating the Budgeting Paradox

5 min read - Budgeting, an essential pillar of financial planning for organizations, often presents a unique dilemma known as the “Budgeting Paradox.” Ideally, a budget should give the most accurate and timely idea of anticipated revenues and expenses. However, the traditional budgeting process, in its pursuit of precision and consensus, can take several months. By the time the budget is finalized and approved, it might already be outdated.In today's rapid pace of change and unpredictability, the conventional budgeting process is coming under scrutiny.It's…

How Macmillan Publishers authored success using IBM Cognos Analytics

5 min read - Macmillan Publishers is a global publishing company and one of the “Big Five” English language publishers. If you're a reader, chances are good you've read a book from Macmillan. They published many perennial favorites including Kristin Hannah’s The Nightingale, Bill Martin’s Brown Bear, Brown Bear, what do you see? and some of the more recent bestsellers such as The Silent Patient by Alex Michaelides, Identity by Nora Roberts and Razorblade Tears by S. A. Cosby. It’s no wonder then that Macmillan…

MLOps and the evolution of data science

7 min read - The advancement of computing power over recent decades has led to an explosion of digital data, from traffic cameras monitoring commuter habits to smart refrigerators revealing how and when the average family eats. Both computer scientists and business leaders have taken note of the potential of the data. The information can deepen our understanding of how our world works—and help create better and “smarter” products. Machine learning (ML), a subset of artificial intelligence (AI), is an important piece of data-driven…