What's the big deal about Big SQL?
Introducing relational DBMS users to IBM's SQL technology for Hadoop
Big data: It's a big deal for IT professionals who analyze and manage information. But some professionals struggle to learn how to work with big data because one of the most popular big data platforms — Apache Hadoop — introduces a wealth of new technologies, including new query and scripting languages.
Enter Big SQL, IBM's SQL interface to its Hadoop-based platform, InfoSphere
BigInsights. Big SQL is designed to provide SQL developers with an easy
on-ramp for querying data managed by Hadoop. It enables data
administrators to create new tables for data stored in Hive, HBase, or
their BigInsights distributed file system. In addition, a
LOAD command enables administrators to populate Big SQL
tables with data from various sources. And Big SQL's JDBC and ODBC drivers
enable many existing tools to use Big SQL to query this distributed
Yet Big SQL doesn't turn Hadoop into one big distributed relational database. If you're curious about what Big SQL can do, we'll explore the basics here, try to clear up some common misconceptions, and answer many of the questions that relational DBMS users often have about this new technology.
The big picture about Big SQL
Big SQL is a software layer that enables IT professionals to create tables and query data in BigInsights using familiar SQL statements. To do so, programmers use standard SQL syntax and, in some cases, SQL extensions created by IBM to make it easy to exploit certain Hadoop-based technologies. You'll learn more about these topics later.
To give you an idea of what Big SQL is all about, Figure 1 illustrates its architecture and how it fits into the BigInsights Enterprise Edition 2.1 platform. (If you're not familiar with BigInsights, skim through the introductory article listed in the Related topics section.)
Figure 1. Big SQL architecture
As shown at the top, Big SQL supports JDBC and ODBC client access from
Linux® and Windows® platforms. In addition, the Big SQL
LOAD command can read data directly from several relational
DBMS systems (IBM PureData™ Systems for Analytics powered by
Netezza technology, DB2®, and Teradata) as well as from files stored
locally or within the BigInsights distributed file system. BigInsights EE
2.1 can be configured to support the Hadoop Distributed File System (HDFS)
or IBM's General Parallel File System with the File Placement Optimizer
The SQL query engine supports joins, unions, grouping, common table expressions, windowing functions, and other familiar SQL expressions. Furthermore, you can influence the data access strategy for your queries with optimization hints and configuration options. Depending on the nature of your query, your data volumes, and other factors, Big SQL can use Hadoop's MapReduce framework to process various query tasks in parallel or execute your query locally within the Big SQL server on a single node — whichever may be most appropriate for your query.
Organizations interested in Big SQL often have considerable SQL skills in-house, as well as a suite of SQL-based business intelligence applications and query/reporting tools. The idea of being able to leverage existing skills and tools — and perhaps reuse portions of existing applications — can be quite appealing to organizations new to Hadoop. Indeed, some companies with large data warehouses built on relational DBMS systems are looking to Hadoop-based platforms as a potential target for offloading "cold" or infrequently used data in a manner that still allows for query access. In other cases, organizations turn to Hadoop to analyze and filter non-traditional data (such as logs, sensor data, social media posts, etc.), ultimately feeding subsets or aggregations of this information to their relational warehouses to extend their view of products, customers, or services.
In such situations and others, Big SQL can play an important role. However,
it's inappropriate to think of Big SQL as a replacement for relational
DBMS technology. Big SQL was designed to complement and leverage the
Hadoop-based infrastructure in BigInsights. Certain features common to
relational DBMS systems aren't present in Big SQL, and certain Big SQL
features aren't present in most relational DBMS systems. For example, Big
SQL supports querying data, but does not support SQL
INSERT statements are
supported only for HBase tables. Big SQL tables may contain columns of
complex data types such as
rather than simply "flat" rows. Furthermore, several underlying storage
mechanisms are supported, including:
- Delimited files (such as comma-separated files) stored in HDFS or GPFS-FPO
- Hive tables in sequence file format, RCFile format, etc. (Hive is Hadoop's data warehouse implementation)
- HBase tables (HBase is Hadoop's key-value or column-based data store)
Let's explore Big SQL in greater detail so you can get a better idea of its capabilities.
Working with Big SQL
BigInsights includes several tools and interfaces for Big SQL that are largely comparable to tools and interfaces you'll find with most relational DBMS systems. As mentioned, Big SQL offers JDBC and ODBC support to Java™ technology, C, and C++ application developers through a JDBC Type 4 driver and a 32- or 64-bit ODBC driver. These Big SQL drivers include support for popular capabilities, such as prepared statements, database metadata APIs, and cancellation of an actively running statement.
In addition, the BigInsights Eclipse plug-in enables Java developers to create, test, and refine Big SQL queries and applications. Figure 2 illustrates some aspects of this plug-in, including a JDBC server connection for Big SQL (shown in the foreground) and the results of a Big SQL test run (shown in the lower-right pane).
Figure 2. The BigInsights Eclipse plug-in includes support for Big SQL development
For interactive invocation of Big SQL queries, BigInsights offers a command-line interface (the JSqsh shell) and a web-based interface (accessible through the BigInsights web console). Such tools can be useful for scripting activities as well as prototyping work.
A variety of IBM and non-IBM software that supports JDBC and ODBC data sources can also be configured to work with Big SQL. As an example, Cognos Business Intelligence uses Big SQL's JDBC interface to query data, generate reports, and perform other analytical functions.
Like many other components in BigInsights, Big SQL is a service that administrators launch (or stop) as needed through the web console or a command window.
Creating tables and loading data
As you might imagine, working with Big SQL requires someone to create
tables and populate them with data. And, as you might imagine, Big SQL
CREATE TABLE statement and a
command to let you do just that. While the basic syntax of these will
probably look familiar to you, there are some aspects of table creation
and data loading that probably won't. That's because they're designed to
exploit certain Hadoop-based technologies.
Consider the example shown in Listing 1. (If you're familiar with the sample GOSALES data warehouse database provided with Cognos Business Intelligence 10.2, you'll recognize this example as a simplified version of one of its tables. See Related topics for a link to details on the Cognos database samples.)
Listing 1. Creating a Big SQL table and loading it with data from a local file
create table mygosales.product_brand_lookup ( product_brand_code int, product_brand_en varchar(180) ) row format delimited fields terminated by '\t'; load hive data local inpath '/home/user1/data/product.tsv' overwrite into table mygosales.product_brand_lookup;
CREATE TABLE statement creates a Hive table with two
columns; the first captures a numeric code to serve as the identifier of a
product brand, while the second captures an English-based description of
the brand. The final line of this statement specifies the way in which the
data will store (and expect) input data: in a row format with
tab-delimited fields. (As an aside, Big SQL shares its catalog with Hive
and the table definitions. So when we use the term "Hive table," we also
mean a Big SQL table. With a few exceptions, a table created through Big
SQL can be queried by Hive and vice-versa.)
Listing 1 illustrates only one of several possible storage
options for Big SQL tables. For example, you could create an externally
managed Hive table in which the data would reside in your distributed file
system outside a Hive database. Consequently, dropping such a table from
Big SQL will only delete the metadata, not the actual data. Or you could
use HBase as your underlying storage manager, in which case your
CREATE TABLE statement would include syntax for mapping
between SQL columns and HBase columns, including column families and a row
key. If you're not familiar with Hive or HBase, see Related topics.
Let's turn briefly to the
LOAD statement in Listing 1. Here, we've provided the full path to a file in
the local file system we want loaded into the table. Given our table
definition, each record in this file must contain two fields (an integer
and a string) delimited by
\t (the tab character). The
OVERWRITE clause instructs Big SQL to replace the contents of
the table with the data contained in the file. Although not shown in Listing 1, you can eliminate the need to load data into your
table if the data already exists in your BigInsights distributed file
system. To do so, you would just create an external table with a
LOCATION clause that specifies where your existing data
In addition, there are a number of options for creating tables and loading data that affect database design and application use. For example, Hadoop-based systems are often used to store various types of semi-structured data, such as JSON data. Developers working directly with Hive commonly rely on custom serializers/deserializers (SerDes) to read and write this data. Consequently, Big SQL also supports custom SerDes and enables you to specify the required SerDe class name for your table at creation time.
Many of the data types common to relational DBMS systems are supported by
Big SQL, including some data types not supported directly by Hive or
HBase. Big SQL data types include various forms of numbers (such as
integers, decimal, float, and double), string data
CHAR(length), VARCHAR(length), and
TIMESTAMP. Other forms of date/time data should be stored as
TIMESTAMP values or a string type. Large objects and
VARGRAPHIC types aren't supported.
Views and user-defined constraints — both common in relational
databases — aren't supported in this release. Referential integrity
constraints and domain-specific constraints should be enforced at the
application level. Instead of using
REVOKE statements to restrict data access, administrators use
standard Hadoop commands to specify file system access privileges for Hive
data. Consequently, you should think of privileges at a table level rather
than a row or column level.
So what's this all mean to an experienced relational database administrator or designer? If you'd like to create a simple test database in Hadoop, you can do that pretty easily with Big SQL. Indeed, the Quick Start edition of BigInsights includes scripts and a step-by-step tutorial for creating a version of the GOSALES data warehouse database (see Related topics). But designing a production database in BigInsights means you'll need to understand certain Hadoop-based technologies so that you can implement your design in a way that's appropriate for the platform.
Once you have at least one Big SQL table defined, you can query it using
syntax that conforms to the SQL standard. Big SQL supports
SELECT statements that project, restrict, join, union, order,
and group data. Sub-queries and common table expression (queries that
start with a
WITH clause) are also supported. Dozens of
built-in functions are provided, including some specifically for Hive
compatibility. Windowing functions are also supported. SQL programmers can
also limit for the number of rows returned by a given query, if desired.
Listing 2 illustrates a query that joins data from four Big SQL tables, yielding a result set with three columns and limiting the number of returned rows to 50.
Listing 2. Sample Big SQL join query
SELECT pnumb.product_name, sales.quantity, meth.order_method_en FROM gosalesdw.sls_sales_fact sales, gosalesdw.sls_product_dim prod, gosalesdw.sls_product_lookup pnumb, gosalesdw.sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key LIMIT 50;
As mentioned, Big SQL supports complex data types, namely
struct. Perhaps you're curious about
how you'd query tables containing columns of such types. Let's use the
example in Listing 3 to explore this a bit.
To begin, Listing 3 creates an EMPLOYEES table with three
columns, two of which are based on complex data types. The PHONE column
stores an array of phone numbers, while the ADDRESS column stores a
structure containing four fields that comprise a USA mailing address,
including the street name, city name, state name, and zip code. A colon
:) separates items in the PHONE and ADDRESS columns.
After loading some sample data, shown as a comment in the listing, Big SQL users can query this data using simple array notations for the PHONE data and path expressions for the ADDRESS data. The final statement in Listing 3 retrieves the name, first phone number, and ZIP code of all employees.
Listing 3. Working with complex data types
CREATE TABLE employees ( name VARCHAR(100), phones ARRAY<VARCHAR(12)>, address STRUCT<street:VARCHAR(100), city:VARCHAR(100), state:VARCHAR(2), zip:VARCHAR(5)> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':'; /* table data : Uttam,408-111-1111:408-222-2222,555 Bailey Ave:San Jose:CA:95141 Cindy,408-333-3333:408-444-4444,555 Bailey Ave:San Jose:CA:95141 */ LOAD... ; SELECT name, phones, address.zip FROM employees;
As you might imagine, we can't cover all Big SQL query capabilities here. For details, consult the BigInsights Information Center (see Related topics). However, one query topic we'll discuss in a moment involves query optimization hints.
For now, let's step back and consider some other aspects of Big SQL. As we
aren't supported in BigInsights 2.1, and
are supported only for HBase tables. To populate Big SQL tables, you can
load data directly from supported relational DBMS systems or from files.
In addition, you can also create new tables based on Big SQL query result
sets by issuing a
CREATE TABLE AS . . . . SELECT FROM . . .
Traditional transaction management isn't part of the Hadoop ecosystem, so
Big SQL operates without transactions or lock management. This implies
that commit and rollback operations aren't supported, and certain
concurrent operations can result in an application or query error. (For
example, if someone issues a
LOAD . . . OVERWRITE command for
a table that's actively being queried, the query will receive an
Since performance is frequently a concern when working with information management systems, we'll briefly discuss some performance considerations for Big SQL. Big SQL divides execution of a query into parts, such as join, group-by, etc. Depending on the query, the amount of data, configuration settings, and other factors, Big SQL can execute these parts serially or in parallel. Parallelism is achieved by leveraging Hadoop's MapReduce framework. As you might expect, a single query may spawn multiple MapReduce jobs. The MapReduce framework executes each job in parallel using multiple mappers or reducers (tasks). This can be quite beneficial for complex queries over large data sets.
However, launching a MapReduce job involves a certain amount of processing overhead. This overhead can outweigh the benefits of parallel processing for certain types of queries, such as those that operate over small data sets or retrieve data associated with a specific HBase row key. In such cases, it's better for the query to be executed serially on a single node. This is sometimes referred to as "local" query execution, and it's something that Big SQL supports in addition to MapReduce parallelism.
Commercial relational DBMS systems employ sophisticated cost-based
optimizers to select an efficient data access strategy for their queries,
consulting extensive statistics related to table size, data skew, etc. Big
SQL's query optimizer also consults certain statistics dynamically to
determine an efficient data access strategy. However, in some cases, Big
SQL may not have sufficient statistics available. For example, its
underlying data source may not provide such information. In such cases, it
can be helpful for Big SQL programmers to embed optimization hints into
their queries, as doing so can enable Big SQL to generate a better
execution plan. Hints can relate to the query execution mode (local or
parallel), join method, index usage, etc. In Big SQL, query hints take the
/*+ name=value[, name=value ..] +*/.
Consider the query shown in Listing 4, which joins a fact
table (FACT) and a dimension table (DIM). Note that there is also a
restriction on the rows of interest from the dimension table. Given that
dimension tables are typically much smaller than fact tables and that our
query only is interested in a subset of rows in the dimension table, it's
prudent for us to supply a hint to Big SQL indicating that the dimension
table is "small." This hint appears immediately after we referenced the
table in the
FROM clause. Supplying this hint helps Big SQL
select an appropriate join methodology for this query in which the
qualifying rows will be brought into memory and joined with the data from
the fact table.
Listing 4. Embedding a hint in a query
select * from fact, dim /*+ tablesize='small' +*/ where dim.dim_id = fact.dim_id and dim.id < 100;
Since we mentioned indices earlier, let's discuss them briefly. Big SQL
enables you to create secondary indices for HBase using a
CREATE INDEX statement. As you might imagine, these indices
can improve the runtime performance of queries that filter on indexed
columns. HBase indices can be based on a single or composite key, and
using Big SQL to insert data or load data from a file into an HBase table
will automatically update its indices. However, in BigInsights 2.1,
loading data from a remote relational database into an HBase table will
not automatically update its secondary indices. Instead, an administrator
needs to drop and re-create the necessary indices.
In Hadoop, programmers or administrators often modify job properties to tune runtime performance. While Big SQL tries to choose optimal properties, you can override these properties at the server or query level, if needed. For example, the first statement in Listing 5 modifies a particular MapReduce job property for a given query connection. It specifies that each subsequent query will use one reduce task per 100 MB of table data. Later in the listing, this setting is overridden, causing subsequent queries for the connection to use one reduce task per 200 MB of table data.
Listing 5. Setting a MapReduce job property
SET bigsql.reducers.byte.per.reducer = 104857600; -- All the MapReduce jobs spawned by all future queries -- in this connection will use 1 reducer per 100MB of table data SELECT …. ; SET bigsql.reducers.byte.per.reducer = 209715200; -- Now all the MapReduce jobs spawned by all future queries -- in this connection will use 1 reducer per 200MB of table data SELECT …. ;
Big SQL brings an industry-standard query interface to IBM's Hadoop-based platform: InfoSphere BigInsights. While Big SQL doesn't turn BigInsights into a relational DBMS, it does provide experience SQL users with a familiar on-ramp to an increasingly popular environment for analyzing and storing big data.
In this article, we introduced the basics of Big SQL and highlighted some areas in which BigInsights (with its Big SQL support) differs from a traditional relational DBMS. If you're eager to learn more, download the Quick Start edition of BigInsights and follow the "Developing Big SQL queries to analyze big data" tutorial provided with the product's Information Center, or see the tutorial in the InfoSphere BigInsights tutorial collection (see Related topics).
Thanks to the following people who contributed materials to this article or helped to review it, in alphabetical order: Bruce Brown, Seeling Cheung, Scott Gray, Mark Hager, Ellen Patterson, and Bert Van der Linden.
- Download a free native software installation copy of InfoSphere BigInsights 2.1 Quick Start Edition (sign-in required).
- Download a free VMware image of InfoSphere BigInsights 2.1 Quick Start Edition (sign-in required).
- Get a technical introduction to Big SQL on Slideshare.
- Learn more about BigInsights 2.1 by accessing the BigInsights Information Center..
- Read "Understanding InfoSphere BigInsights" to learn more about the product's architecture and underlying technologies.
- Watch BigInsights experts discuss the technology, give demos, and answer common questions on the IBM big data channel on YouTube.
- Enroll in free online courses at Big Data University.
- Learn more about HBase and Hive at Apache.org.
- Get familiar with the Cognos sample GOSALES databases by accessing the product's Information Center.