Performance capabilities

Db2® Big SQL provides superior SQL-on-Hadoop performance to optimize data ingestion and query performance for your enterprise. Db2 Big SQL is not only fast and efficient, but more importantly can successfully execute the most demanding queries on big data. Db2 Big SQL provides a robust runtime environment and is compliant with SQL standards.

Db2 Big SQL has demonstrated its superior scalability and performance by executing all 99 compliant Hadoop-DS queries (based on TPC-DS, the industry standard relational benchmark) at the 100 TB scale factor using only allowed minor query modifications. Furthermore, Db2 Big SQL can run these queries three times faster than Spark SQL, while using far fewer resources.

Db2 Big SQL delivers unmatched high-concurrency workload execution. In addition, Db2 Big SQL provides mature autonomics and does not require fine tuning based on the number of concurrent query streams.

Db2 Big SQL provides a Db2-compatible engine for Hadoop data and is based on decades of investment by IBM® research and development in RDBMS technology. Key components of Db2 Big SQL's world-class scalability and performance include:
  • An advanced cost-based query optimizer
  • A mature, massively parallel processing (MPP) architecture
  • A robust runtime environment for analytical workloads
  • A self-tuning memory manager (STMM)
  • Elastic boost technology
  • A sophisticated workload manager (WLM)
  • Materialized query tables (MQTs)
Db2 Big SQL extends each of these Db2 components to be Hadoop aware. Because most of the components in Db2 Big SQL are based on existing IBM RDBMs technology, they have already been deployed in major enterprises across the globe—that is, they are already tried and tested.

Query optimizer

Db2 Big SQL has a world-class query optimizer and query re-write engine that are based on decades of IBM research and development. The cost-based optimizer uses statistics to dynamically select the optimal data access paths.

The query optimizer takes DML (data manipulation language) statements, such as SQL queries, and determines the most optimal path to resolve them. In Db2 Big SQL, IBM extended its relational database optimizer to process data that resides on HDFS. The optimizer considers various HDFS factors to select the best path. These factors include:
Cluster configuration
  • Number of Db2 Big SQL worker nodes in the cluster
  • Various configuration parameters (which are automatically tuned based on the resources available to Db2 Big SQL); parameters include sort space, bufferpool, and so on
Database design
  • Storage format of the table (for example, Parquet, ORC)
  • Table partitioning
  • Table location (for example, local or remote HDFS)
  • Informational constraints
  • HBase indexes
Table statistics—automatically collected by the ANALYZE command
  • Number of rows in a table
  • Number of unique values in a column
  • Second highest and lowest values in a column
  • Value distribution statistics

Db2 Big SQL uses a cost-based optimizer, meaning it considers several access plans and costs them based on the estimated runtime and resource consumption, then selects the plan with the lowest cost. Db2 Big SQL also features an extensive query compiler, which can automatically apply known query rewrite optimizations to your SQL. You get highly optimized SQL without the hours of tuning effort.

Figure 1. A query and an access plan graph for the query with cost details shown for one of the nodes
Access plan graph for a query

Learn more: Db2 for Linux®, UNIX, and Windows documentation: The SQL and XQuery compiler process

Self-tuning memory manager (STMM)

Db2 Big SQL automatically calculates the memory required for sorting, caching, and various heaps based on the memory available on the cluster and the percent usage allocated for Db2 Big SQL. As workloads are executed, the Db2 Big SQL self-tuning memory manager (STMM) monitors the memory consumption between the consumers, and dynamically manages the memory allocation based on query concurrency and the workload being executed. There’s no need to manually tune the Db2 Big SQL memory allocation—that’s all done automatically by STMM. STMM solves two key problems with memory tuning that are evident in other SQL-on-Hadoop solutions:
  • Determining an optimal memory configuration is a difficult task.
  • Static memory configurations are sub-optimal for dynamic workloads.
STMM makes its decisions with the help of internal metrics that predict how additional memory will affect a given heap. These metrics, when combined with STMM’s advanced tuning algorithms, can in most cases, tune a system from an out-of-the-box configuration to near-optimal memory usage in a short amount of time. This approach is well suited to typical Hadoop workloads where workload memory requirements can change dramatically over time.

Learn more: Db2 for Linux, UNIX, and Windows documentation: Self-tuning memory overview

Elastic boost technology (logical worker nodes)

Db2 Big SQL uses elastic boost technology to provide superior SQL-on-Hadoop performance. Elastic boost enables multiple Db2 Big SQL workers to run on each Hadoop host. Running multiple workers on each host allows Db2 Big SQL to make more efficient use of the memory and CPU resources available on each node, which in most cases improves performance. Db2 Big SQL is able to start and stop workers independently of each other, which allows resources to be freed up for other Hadoop components running on the same node.

Elastic boost

Learn more: Hadoop Dev article: Logical Db2 Big SQL Workers Boost Db2 Big SQL Performance

Workload manager (WLM)

Db2 Big SQL comes with a sophisticated Workload Manager (WLM) that provides control over workload priorities and resources (CPU and memory).

The Workload Manager identifies and manages the work being executed by Db2 Big SQL. It provides a variety of configurable capabilities that enable Db2 Big SQL to efficiently handle the most demanding mixed workloads including high concurrency environments.

With Workload Manager, you can prioritize the resources available to different groups of users and define thresholds that determine how and when work is permitted to run. Examples of activity thresholds are how long a query is allowed to run, how much resource a query is allowed to use, and the volume of data returned. Possible responses when a threshold is exceeded are stopping execution, issuing a warning but letting the activity finish, queuing incoming activities until the threshold is no longer exceeded, or dynamically raising or lowering the priority and amount of resources the activity is allowed to consume.

Another popular feature of Workload Manager is priority aging, where the priority of in-flight activities automatically changes over time. WLM is typically configured so that the longer an activity runs, the lower its priority, and the fewer resources it will receive. By lowering the priority over time for long running queries, you can improve the throughput of shorter-running queries.

WLM is enabled by default in Db2 Big SQL and its default configuration is designed to ensure consistent query response times based on the available resources. WLM ensures that resources are shared fairly and sensibly between all activities and is a key reason that Db2 Big SQL has proven scalability and can efficiently handle mixed workloads.

Learn more: Db2 for Linux, UNIX, and Windows documentation: Db2 workload manager

Materialized query tables (MQTs)

Db2 Big SQL supports materialized query tables (MQTs), also known as materialized views, which can significantly improve response times for complex, analytical queries.

MQTs are tables whose definition is based on the result of a query. The data in the MQT consists of precomputed results for the query that you specify in when you define the MQT. Db2 Big SQL supports both Db2 MQTs (defined over both Db2 and Hadoop tables) and native Hadoop MQTs (defined over Hadoop tables only).

MQTs provide the best results for queries that use one or more of the following types of data:
  • Relatively static data
  • Aggregate data over one or more dimensions
  • Joins and aggregate data over a group of tables
  • Data from a commonly accessed subset of data

The larger the base tables, the more significant the potential improvements in response time will be when you use MQTs. MQTs eliminate overlapping work among queries because computations are performed only once when MQTs are built and once each time that they are refreshed. This content of the MQT can then be reused during the execution of many queries, thus saving on execution costs.

MQTs enhance Db2 Big SQL's capacity to handle interactive business intelligence (BI) workloads, especially those with predictable BI queries. The original BI queries do not need to be changed to exploit the MQT—this is done automatically by the query optimizer.

Powerful tuning features

Db2 Big SQL provides you with multiple methods to optimize performance, from adjusting CPU and memory resources, to tuning HBase performance, to optimizing data access, and more. These enterprise relational database features are essential for supporting production environments.

Learn more: Performance tuning