IBM BigInsights

IBM Big SQL

Big SQL for Version 4.0 and later, is a massively parallel processing (MPP) SQL engine that deploys directly on the physical Hadoop Distributed File System (HDFS) cluster.

This SQL engine pushes processing down to the same nodes that hold the data. Big SQL uses a low-latency parallel execution infrastructure that accesses Hadoop data natively for reading and writing.

Scenarios

A typical scenario for the use of Big SQL includes the following steps:
  1. Download and install the IBM® Open Platform with Apache Hadoop.
  2. Verify the configurations for the Big SQL component in your particular environment.
  3. Authorize users to access data sources.
  4. Monitor system performance with Big SQL monitoring.
  5. Complete the tutorial to learn some of the basic query methods and query environments. Create Hadoop tables, Create HBase tables, and Load data into those tables.
  6. Develop and administer additional applications using Big SQL. See the Big SQL reference for Big SQL syntax.
  7. Troubleshoot Big SQL applications.

Metadata

Big SQL uses the Hive database catalog (HCatalog) for table definitions, location, storage format and encoding of input files. This Big SQL catalog resides on the head node.

As long as the data is defined in the Hive Metastore and accessible in the Hadoop cluster, Big SQL can get to it. Big SQL stores some of the metadata from the Hive catalog locally for ease of access and to facilitate query execution.

Concurrency and workload management

Big SQL uses automatic memory tuning to use memory efficiently and to prevent out-of-memory exceptions. As part of the BigInsights installation, you can choose how much memory to make available to use for SQL processing. Big SQL optimizes the SQL processing within these limits. It manages concurrent workload automatically by balancing resources across running queries. Memory resources are distributed across shared buffer pools.

Big SQL monitors metrics on resource consumption, and reacts to changing conditions. Big SQL uses a self-tuning memory management infrastructure that tunes memory areas based on the size of the physical host on which the worker node (data partition) is running. You do not need to manage and tune each worker node for performance and concurrency.

The workload management tools provide admission control for various classes of concurrent queries. You can configure the system by defining a workload management class to limit the number of concurrent heavy analytical SQL requests. By configuring your system this way, you can avoid over-stressing the CPU, memory resources, and disk I/O.

System flexibility

Because Big SQL can maintain the status of the work nodes, and the scheduler service can drive scattered partitioning, the system can distribute query execution to only those nodes that are available. If you add or remove worker nodes, the scheduler detects this change and adjusts its assignments.

Common client drivers

Big SQL uses the IBM Data Server client drivers. This driver package uses the same standards-compliant JDBC, JCC, ODBC, CLI and .NET drivers that are used in other IBM software products, such as DB2 for Linux, UNIX and Windows, DB2 for z/OS, and IBM Informix® database software. Because the same driver is shared across these platforms means that other languages that already use these drivers (such as Ruby, Perl, Python and PHP) can interact with Big SQL with no additional custom configuration or drivers. Therefore, applications can interact between traditional database management systems or data warehouse systems and Big SQL.

Scheduler

The Big SQL scheduler is a service that acts as a liaison between the SQL processes and Hadoop. It provides a number of important services:
  • It interfaces with the Hive metastore to get basic information, such as column names and data types.
  • It eliminates partitions that are not relevant for a query because it reads the predicates that are pushed down from the compiler.
  • It schedules work by knowing where and how data is stored on Hadoop. This information ensures that work is processed efficiently, as close to the data as possible. The scheduler can take into account the worker node load and user other hints for query plan distribution.

SQL capabilities

The Big SQL query capabilities, which follow the specifications of the SQL:2011 language standard, include significant levels of SQL PL compatibility, including stored procedures, SQL-bodied functions, and a rich library of scalar, table and online analytical processing (OLAP) functions. It contains an SQL compatibility between database platforms and maintains interoperability with established BigInsights tools. The following SQL features are included in Big SQL:
  • The only limit to the size of the queries, groups, and sorting is the disk capacity of the cluster.

    Big SQL uses in-memory caching, and can spill large data sets to the local disk at each node that is processing a query.

  • You can use subqueries anywhere that an expression can be used. Subqueries can be correlated or uncorrelated.
  • You can use table expressions, such as common table expressions, built-in or user-defined table functions, VALUES expressions, and lateral joins.
  • You can perform all valid SQL standard join operations, group sets, and union operations.
  • You can perform all of the standard OLAP specifications, for windowing and analytic functions.
  • Big SQL supports scalar functions, table functions, and procedures.

Federation

Big SQL supports federation to many data sources, which allows users to send distributed requests to multiple data sources within a single SQL statement.

To end users and client applications, data sources appear as a single collective group in the Big SQL server. Users and applications interface with the Big SQL server to access the data. The Big SQL server processes SQL statements as if the data from all sources were ordinary tables or views within the Big SQL server.



Feedback