With Db2® Big SQL, your organization can derive
significant value from your enterprise data.
IBM
Db2 Big SQL is a high performance massively parallel
processing (MPP) SQL engine for Hadoop that makes querying enterprise data from across the
organization an easy and secure experience. A Db2 Big SQL
query can quickly access a variety of data sources including HDFS, GPFS, RDBMS, NoSQL databases,
object stores, and WebHDFS by using a single database connection or single query for best-in-class
analytic capabilities.
Although you can create Db2 local tables in Db2 Big
SQL, this is not recommended practice, because these tables will reside on the head node only (there
is no parallel data access). Doing this can also result in memory and performance issues if the head
node becomes a bottleneck.
Db2 Big SQL provides tools to help you manage your system
and your databases, and you can use popular analytic tools to visualize your data.
How Db2 Big SQL works
Db2 Big SQL architecture
Why Db2 Big SQL?
Db2 Big SQL features
How Db2 Big SQL works
Db2 Big SQL's robust engine executes complex queries for
remote relational data and Hadoop data. Db2 Big SQL provides
an advanced SQL compiler and a cost-based optimizer for efficient query execution. Combining these
with a massive parallel processing (MPP) engine helps distribute query execution across nodes in a
cluster.
Db2 Big SQL architecture
Built on the world class IBM common SQL database technology, Db2 Big SQL is a massively parallel processing (MPP) database
engine that has all the standard RDBMS features and is optimized to work with the Apache Hadoop
ecosystem.
For example, the following diagram shows how Db2 Big SQL
fits within the overall Apache Hadoop architecture. The direction of the communication flow arrows
indicate initiation.
The Db2 Big SQL server or service consists of one Db2 Big SQL head (two heads in an HA configuration) that is
installed on a node called the head node, and multiple Db2 Big
SQL workers that are installed on nodes called worker nodes.
- Db2 Big SQL server
- A general term to describe the Db2 Big SQL software or the
Db2 Big SQL processes. Db2 Big SQL service is a
synonym for Db2 Big SQL server in the context of Db2 Big SQL as a service in the CDP stack.
- Db2 Big SQL head
- The set of Db2 Big SQL processes that accept SQL query
requests from applications and coordinate with Db2 Big SQL
workers to process data and compute the results.
- Db2 Big SQL head node
- The physical or virtual machine (node) on which the Db2
Big SQL head runs.
- Db2 Big SQL worker
- The set of Db2 Big SQL processes that communicate with the
Db2 Big SQL head to access data and compute query results. Db2 Big SQL workers are normally collocated with the HDFS
DataNodes to facilitate local disk access. Db2 Big SQL can
access and process HDFS data that conforms to most common Hadoop formats, such as ORC, Parquet, text
(TEXTFILE), and so on. For more details about the supported data formats, see File formats that are supported by Db2 Big SQL.
The Db2 Big SQL head coordinates the processing of SQL queries
with the workers, which handle most of the HDFS data access and processing.
- Db2 Big SQL worker node
- The physical or virtual machine (node) on which the Db2
Big SQL worker runs.
- Db2 Big SQL scheduler
- A process that runs on the Db2 Big SQL head node. The
scheduler's function is to bridge the RDBMS domain and the Hadoop domain. The scheduler communicates
with the Hive metastore to determine Hive table properties and schemas, and the HDFS NameNode to
determine the location of file blocks. The scheduler responds to Db2 Big SQL head and worker requests for information about Hadoop data, including HDFS, HBase,
object storage, and Hive metadata. For more information about the scheduler, see Db2 Big SQL scheduler.
- Db2 Big SQL metadata
- HDFS data properties such as name, location, format, and the desired relational schemas. This
metadata, gathered through the scheduler, is used to enable consistent and optimal SQL processing of
queries against HDFS data.
The following steps represent a simple overview of how Db2
Big SQL processes HDFS data:
- Applications connect.
Applications connect to the Db2 Big SQL head on the head node.
- Queries are submitted.
Queries submitted to the Db2 Big SQL head are compiled into optimized parallel execution plans by using the IBM® common SQL engine's query optimizer.
- Plans are distributed.
The parallel execution plans are then distributed to Db2 Big SQL workers on the worker nodes.
- Data is read and written.
Workers have separate local processes called native HDFS
readers and writers that read or write HDFS data in its stored format. A Db2 Big SQL
reader is comprised of Db2 Big SQL processes that run
on the Db2 Big SQL worker nodes and read data at the request
of Db2 Big SQL workers. Similarly, a Db2 Big SQL
writer is comprised of Db2 Big SQL processes that run
on the Db2 Big SQL worker nodes and write data at the request
of Db2 Big SQL workers. The workers communicate with these
native readers or writers to access HDFS data when they process execution plans that they receive
from the Db2 Big SQL head. For more information, see Db2 Big SQL readers and
writers.
- Predicates are applied.
Native HDFS readers can apply predicates and project desired
columns to minimize the amount of data that is returned to workers.
Why Db2 Big SQL?
Db2 Big SQL provides world-class scalability and
performance, and supports the following key use cases:
- Enterprise Data Warehouse (EDW) offloading
- Db2 Big SQL understands commonly-used SQL syntax from
other vendors and producers. You can offload and consolidate old data more quickly and easily from
existing Oracle, IBM
Db2, and IBM Netezza®
enterprise data warehouses or data marts while preserving most of the SQL from those platforms.
- Federated access to relational data
- For data that can't be moved to Hadoop, Db2 Big SQL
provides federated access to many relational database management system (RDBMS) sources outside of
Hadoop with IBM Fluid Query technology and NoSQL databases with the use of Spark connectors. You can
use a single database connection to access data across Hadoop and dozens of relational/NoSQL
database types, whether they are on the cloud, on local systems, or both. Wherever the data resides,
Db2 Big SQL offers data virtualization and enables querying
disparate sources in a single query.
Db2 Big SQL also boasts:
- Elastic boost technology to support more granular resource usage and increase performance
without increasing memory or CPU
- High-performance scans, inserts, updates, and deletes
- Deeper integration with Spark
2.1 than other SQL-on-Hadoop technologies.
Db2 Big SQL 7.28 and later
service Deeper integration with Spark 2.4 requires the Db2 Big SQL 7.28 and later service on IBM Cloud Pak® for Data 4.5 and later.
- Machine learning or graph analytics with Spark with a single security model
- Open Data Platform initiative (ODPi) compliance
- Advanced, ANSI-compliant SQL queries