Overview of Db2 Big SQL

With Db2® Big SQL, your organization can derive significant value from your enterprise data.

IBM Db2 Big SQL is a high performance massively parallel processing (MPP) SQL engine for Hadoop that makes querying enterprise data from across the organization an easy and secure experience. A Db2 Big SQL query can quickly access a variety of data sources including HDFS, GPFS, RDBMS, NoSQL databases, object stores, and WebHDFS by using a single database connection or single query for best-in-class analytic capabilities.

Although you can create Db2 local tables in Db2 Big SQL, this is not recommended practice, because these tables will reside on the head node only (there is no parallel data access). Doing this can also result in memory and performance issues if the head node becomes a bottleneck.

Db2 Big SQL provides tools to help you manage your system and your databases, and you can use popular analytic tools to visualize your data.

How Db2 Big SQL works

Db2 Big SQL architecture

Why Db2 Big SQL?

Db2 Big SQL features

How Db2 Big SQL works

Db2 Big SQL's robust engine executes complex queries for remote relational data and Hadoop data. Db2 Big SQL provides an advanced SQL compiler and a cost-based optimizer for efficient query execution. Combining these with a massive parallel processing (MPP) engine helps distribute query execution across nodes in a cluster.

Db2 Big SQL design

Db2 Big SQL architecture

Built on the world class IBM common SQL database technology, Db2 Big SQL is a massively parallel processing (MPP) database engine that has all the standard RDBMS features and is optimized to work with the Apache Hadoop ecosystem.

For example, the following diagram shows how Db2 Big SQL fits within the overall Apache Hadoop architecture. The direction of the communication flow arrows indicate initiation.

The Db2 Big SQL server or service consists of one Db2 Big SQL head (two heads in an HA configuration) that is installed on a node called the head node, and multiple Db2 Big SQL workers that are installed on nodes called worker nodes.

Db2 Big SQL server: A general term to describe the Db2 Big SQL software or the Db2 Big SQL processes. Db2 Big SQL service is a synonym for Db2 Big SQL server in the context of Db2 Big SQL as a service in the CDP stack.
Db2 Big SQL head: The set of Db2 Big SQL processes that accept SQL query requests from applications and coordinate with Db2 Big SQL workers to process data and compute the results.
Db2 Big SQL head node: The physical or virtual machine (node) on which the Db2 Big SQL head runs.
Db2 Big SQL worker: The set of Db2 Big SQL processes that communicate with the Db2 Big SQL head to access data and compute query results. Db2 Big SQL workers are normally collocated with the HDFS DataNodes to facilitate local disk access. Db2 Big SQL can access and process HDFS data that conforms to most common Hadoop formats, such as ORC, Parquet, text (TEXTFILE), and so on. For more details about the supported data formats, see File formats that are supported by Db2 Big SQL. The Db2 Big SQL head coordinates the processing of SQL queries with the workers, which handle most of the HDFS data access and processing.
Db2 Big SQL worker node: The physical or virtual machine (node) on which the Db2 Big SQL worker runs.
Db2 Big SQL scheduler: A process that runs on the Db2 Big SQL head node. The scheduler's function is to bridge the RDBMS domain and the Hadoop domain. The scheduler communicates with the Hive metastore to determine Hive table properties and schemas, and the HDFS NameNode to determine the location of file blocks. The scheduler responds to Db2 Big SQL head and worker requests for information about Hadoop data, including HDFS, HBase, object storage, and Hive metadata. For more information about the scheduler, see Db2 Big SQL scheduler.
Db2 Big SQL metadata: HDFS data properties such as name, location, format, and the desired relational schemas. This metadata, gathered through the scheduler, is used to enable consistent and optimal SQL processing of queries against HDFS data.

The following steps represent a simple overview of how Db2 Big SQL processes HDFS data:

Applications connect.
Applications connect to the Db2 Big SQL head on the head node.
Queries are submitted.
Queries submitted to the Db2 Big SQL head are compiled into optimized parallel execution plans by using the IBM® common SQL engine's query optimizer.
Plans are distributed.
The parallel execution plans are then distributed to Db2 Big SQL workers on the worker nodes.
Data is read and written.
Workers have separate local processes called native HDFS readers and writers that read or write HDFS data in its stored format. A Db2 Big SQL reader is comprised of Db2 Big SQL processes that run on the Db2 Big SQL worker nodes and read data at the request of Db2 Big SQL workers. Similarly, a Db2 Big SQL writer is comprised of Db2 Big SQL processes that run on the Db2 Big SQL worker nodes and write data at the request of Db2 Big SQL workers. The workers communicate with these native readers or writers to access HDFS data when they process execution plans that they receive from the Db2 Big SQL head. For more information, see Db2 Big SQL readers and writers.
Predicates are applied.
Native HDFS readers can apply predicates and project desired columns to minimize the amount of data that is returned to workers.

Why Db2 Big SQL?

Db2 Big SQL provides world-class scalability and performance, and supports the following key use cases:

Enterprise Data Warehouse (EDW) offloading: Db2 Big SQL understands commonly-used SQL syntax from other vendors and producers. You can offload and consolidate old data more quickly and easily from existing Oracle, IBM Db2, and IBM Netezza® enterprise data warehouses or data marts while preserving most of the SQL from those platforms.
Federated access to relational data: For data that can't be moved to Hadoop, Db2 Big SQL provides federated access to many relational database management system (RDBMS) sources outside of Hadoop with IBM Fluid Query technology and NoSQL databases with the use of Spark connectors. You can use a single database connection to access data across Hadoop and dozens of relational/NoSQL database types, whether they are on the cloud, on local systems, or both. Wherever the data resides, Db2 Big SQL offers data virtualization and enables querying disparate sources in a single query.

Db2 Big SQL also boasts:

Elastic boost technology to support more granular resource usage and increase performance without increasing memory or CPU
High-performance scans, inserts, updates, and deletes
Deeper integration with Spark 2.1 than other SQL-on-Hadoop technologies. Db2 Big SQL 7.28 and later service Deeper integration with Spark 2.4 requires the Db2 Big SQL 7.28 and later service on IBM Cloud Pak® for Data 4.5 and later.
Machine learning or graph analytics with Spark with a single security model
Open Data Platform initiative (ODPi) compliance
Advanced, ANSI-compliant SQL queries