Db2 Big SQL architecture

Built on the world class IBM common SQL database technology, Big SQL is a massively parallel processing (MPP) database engine that has all the standard RDBMS features and is optimized to work with the Apache Hadoop ecosystem.

The following diagram shows how Db2 Big SQL fits within the overall Apache Hadoop architecture. The direction of the communication flow arrows indicate initiation.

The Db2 Big SQL server or service consists of one Db2 Big SQL head (two heads in an HA configuration) that is installed on a node called the head node, and multiple Db2 Big SQL workers that are installed on nodes called worker nodes.

Definitions

Db2 Big SQL server: A general term to describe the Db2 Big SQL software or the Db2 Big SQL processes. Db2 Big SQL service is a synonym for Db2 Big SQL server in the context of Db2 Big SQL as a service in the HDP stack.
Db2 Big SQL head: The set of Db2 Big SQL processes that accept SQL query requests from applications and coordinate with Db2 Big SQL workers to process data and compute the results.
Db2 Big SQL head node: The physical or virtual machine (node) on which the Db2 Big SQL head runs.
Db2 Big SQL worker: The set of Db2 Big SQL processes that communicate with the Db2 Big SQL head to access data and compute query results. Db2 Big SQL workers are normally collocated with the HDFS DataNodes to facilitate local disk access. Db2 Big SQL can access and process HDFS data that conforms to most common Hadoop formats, such as Avro, Parquet, ORC, Sequence, and so on. For more details about the supported data formats, see File formats supported by Big SQL. The Db2 Big SQL head coordinates the processing of SQL queries with the workers, which handle most of the HDFS data access and processing.
Db2 Big SQL worker node: The physical or virtual machine (node) on which the Db2 Big SQL worker runs.
Db2 Big SQL scheduler: A process that runs on the Db2 Big SQL head node. The scheduler's function is to bridge the RDBMS domain and the Hadoop domain. The scheduler communicates with the Hive metastore to determine Hive table properties and schemas, and the HDFS NameNode to determine the location of file blocks. The scheduler responds to Db2 Big SQL head and worker requests for information about Hadoop data, including HDFS, HBase, object storage, and Hive metadata. For more information about the scheduler, see Db2 Big SQL scheduler.
Db2 Big SQL metadata: HDFS data properties such as name, location, format, and the desired relational schemas. This metadata, gathered through the scheduler, is used to enable consistent and optimal SQL processing of queries against HDFS data.

How Db2 Big SQL processes HDFS data

The following steps represent a simple overview of how Db2 Big SQL processes HDFS data:

Applications connect.
Applications connect to the Db2 Big SQL head on the head node.
Queries are submitted.
Queries submitted to the Db2 Big SQL head are compiled into optimized parallel execution plans by using the IBM common SQL engine's query optimizer.
Plans are distributed.
The parallel execution plans are then distributed to Db2 Big SQL workers on the worker nodes.
Data is read and written.
Workers have separate local processes called native HDFS readers and writers that read or write HDFS data in its stored format. A Db2 Big SQL reader is comprised of Db2 Big SQL processes that run on the Db2 Big SQL worker nodes and read data at the request of Db2 Big SQL workers. Similarly, a Db2 Big SQL writer is comprised of Db2 Big SQL processes that run on the Db2 Big SQL worker nodes and write data at the request of Db2 Big SQL workers. The workers communicate with these native readers or writers to access HDFS data when they process execution plans that they receive from the Db2 Big SQL head. For more information, see Db2 Big SQL readers and writers.
Predicates are applied.
Native HDFS readers can apply predicates and project desired columns to minimize the amount of data that is returned to workers.