Big data: It's a big deal for IT professionals who analyze and manage information. But some professionals struggle to learn how to work with big data because one of the most popular big data platforms — Apache Hadoop — introduces a wealth of new technologies, including new query and scripting languages.
Enter Big SQL, IBM's SQL interface to its Hadoop-based platform, InfoSphere
BigInsights. Big SQL is designed to provide SQL developers with an easy
on-ramp for querying data managed by Hadoop. It enables data
administrators to create new tables for data stored in Hive, HBase, or
their BigInsights distributed file system. In addition, a
LOAD command enables administrators to populate
Big SQL tables with data from various sources. And Big SQL's JDBC and ODBC
drivers enable many existing tools to use Big SQL to query this
Yet Big SQL doesn't turn Hadoop into one big distributed relational database. If you're curious about what Big SQL can do, we'll explore the basics here, try to clear up some common misconceptions, and answer many of the questions that relational DBMS users often have about this new technology.
The big picture about Big SQL
Big SQL is a software layer that enables IT professionals to create tables and query data in BigInsights using familiar SQL statements. To do so, programmers use standard SQL syntax and, in some cases, SQL extensions created by IBM to make it easy to exploit certain Hadoop-based technologies. You'll learn more about these topics later.
To give you an idea of what Big SQL is all about, Figure 1 illustrates its architecture and how it fits into the BigInsights Enterprise Edition 2.1 platform. (If you're not familiar with BigInsights, skim through the introductory article listed in the Resources section.)
Figure 1. Big SQL architecture
As shown at the top, Big SQL supports JDBC and ODBC client access
from Linux® and Windows® platforms. In addition, the Big SQL
LOAD command can read data directly
from several relational DBMS systems (IBM PureData™ Systems for Analytics
powered by Netezza technology, DB2®, and Teradata) as well as from
files stored locally or within the BigInsights distributed file
system. BigInsights EE 2.1 can be configured to support the
Hadoop Distributed File System (HDFS) or IBM's General Parallel
File System with the File Placement Optimizer (GPFS-FPO).
The SQL query engine supports joins, unions, grouping, common table expressions, windowing functions, and other familiar SQL expressions. Furthermore, you can influence the data access strategy for your queries with optimization hints and configuration options. Depending on the nature of your query, your data volumes, and other factors, Big SQL can use Hadoop's MapReduce framework to process various query tasks in parallel or execute your query locally within the Big SQL server on a single node — whichever may be most appropriate for your query.
Organizations interested in Big SQL often have considerable SQL skills in-house, as well as a suite of SQL-based business intelligence applications and query/reporting tools. The idea of being able to leverage existing skills and tools — and perhaps reuse portions of existing applications — can be quite appealing to organizations new to Hadoop. Indeed, some companies with large data warehouses built on relational DBMS systems are looking to Hadoop-based platforms as a potential target for offloading "cold" or infrequently used data in a manner that still allows for query access. In other cases, organizations turn to Hadoop to analyze and filter non-traditional data (such as logs, sensor data, social media posts, etc.), ultimately feeding subsets or aggregations of this information to their relational warehouses to extend their view of products, customers, or services.
In such situations and others, Big SQL can play an important role.
However, it's inappropriate to think of Big SQL as a replacement for
relational DBMS technology. Big SQL was designed to complement and
leverage the Hadoop-based infrastructure in BigInsights. Certain features
common to relational DBMS systems aren't present in Big SQL, and certain Big SQL
features aren't present in most relational DBMS systems. For example, Big SQL
supports querying data, but does not support SQL
INSERT statements are supported
only for HBase tables. Big SQL tables may contain columns of complex data
types such as
array, rather than
simply "flat" rows. Furthermore, several underlying storage mechanisms are
- Delimited files (such as comma-separated files) stored in HDFS or GPFS-FPO
- Hive tables in sequence file format, RCFile format, etc. (Hive is Hadoop's data warehouse implementation)
- HBase tables (HBase is Hadoop's key-value or column-based data store)
Let's explore Big SQL in greater detail so you can get a better idea of its capabilities.
Working with Big SQL
BigInsights includes several tools and interfaces for Big SQL that are largely comparable to tools and interfaces you'll find with most relational DBMS systems. As mentioned, Big SQL offers JDBC and ODBC support to Java™ technology, C, and C++ application developers through a JDBC Type 4 driver and a 32- or 64-bit ODBC driver. These Big SQL drivers include support for popular capabilities, such as prepared statements, database metadata APIs, and cancellation of an actively running statement.
In addition, the BigInsights Eclipse plug-in enables Java developers to create, test, and refine Big SQL queries and applications. Figure 2 illustrates some aspects of this plug-in, including a JDBC server connection for Big SQL (shown in the foreground) and the results of a Big SQL test run (shown in the lower-right pane).
Figure 2. The BigInsights Eclipse plug-in includes support for Big SQL development
For interactive invocation of Big SQL queries, BigInsights offers a command-line interface (the JSqsh shell) and a web-based interface (accessible through the BigInsights web console). Such tools can be useful for scripting activities as well as prototyping work.
A variety of IBM and non-IBM software that supports JDBC and ODBC data sources can also be configured to work with Big SQL. As an example, Cognos Business Intelligence uses Big SQL's JDBC interface to query data, generate reports, and perform other analytical functions.
Like many other components in BigInsights, Big SQL is a service that administrators launch (or stop) as needed through the web console or a command window.
Creating tables and loading data
As you might imagine, working with Big SQL requires someone to create
tables and populate them with data. And, as you might imagine, Big SQL
CREATE TABLE statement and a
LOAD command to let you do just that. While the
basic syntax of these will probably look familiar to you, there are some
aspects of table creation and data loading that probably won't. That's
because they're designed to exploit certain Hadoop-based technologies.
Consider the example shown in Listing 1. (If you're familiar with the sample GOSALES data warehouse database provided with Cognos Business Intelligence 10.2, you'll recognize this example as a simplified version of one of its tables. See Resources for a link to details on the Cognos database samples.)
Listing 1. Creating a Big SQL table and loading it with data from a local file
create table mygosales.product_brand_lookup ( product_brand_code int, product_brand_en varchar(180) ) row format delimited fields terminated by '\t'; load hive data local inpath '/home/user1/data/product.tsv' overwrite into table mygosales.product_brand_lookup;
CREATE TABLE statement creates a Hive table
with two columns; the first captures a numeric code to serve as the
identifier of a product brand, while the second captures an English-based
description of the brand. The final line of this statement specifies the
way in which the data will store (and expect) input data: in a row format
with tab-delimited fields. (As an aside, Big SQL shares its catalog with
Hive and the table definitions. So when we use the term "Hive
table," we also mean a Big SQL table. With a few exceptions, a table
created through Big SQL can be queried by Hive and vice-versa.)
Listing 1 illustrates only one of several possible storage options for Big
SQL tables. For example, you could create an externally managed Hive table
in which the data would reside in your distributed file system outside
a Hive database. Consequently, dropping such a table from Big SQL will
only delete the metadata, not the actual data. Or you could use HBase as
your underlying storage manager, in which case your
CREATE TABLE statement would include syntax for
mapping between SQL columns and HBase columns, including column families
and a row key. If you're not familiar with Hive or HBase, see
Let's turn briefly to the
LOAD statement in
Listing 1. Here, we've provided the full path to a file in the local file
system we want loaded into the table. Given our table definition,
each record in this file must contain two fields (an integer and a string)
\t (the tab character). The
OVERWRITE clause instructs Big SQL to replace
the contents of the table with the data contained in the file. Although
not shown in Listing 1, you can eliminate the need to load data into your
table if the data already exists in your BigInsights distributed file
system. To do so, you would just create an external table with a
LOCATION clause that specifies where your
existing data resides.
In addition, there are a number of options for creating tables and loading data that affect database design and application use. For example, Hadoop-based systems are often used to store various types of semi-structured data, such as JSON data. Developers working directly with Hive commonly rely on custom serializers/deserializers (SerDes) to read and write this data. Consequently, Big SQL also supports custom SerDes and enables you to specify the required SerDe class name for your table at creation time.
Many of the data types common to relational DBMS systems are supported by Big SQL,
including some data types not supported directly by Hive or HBase. Big SQL
data types include various forms of numbers (such as integers, decimal,
float, and double), string data
CHAR(length), VARCHAR(length), and
TIMESTAMP. Other forms of date/time data
should be stored as
TIMESTAMP values or a
string type. Large objects and
Views and user-defined constraints — both common in relational databases —
aren't supported in this release. Referential integrity constraints and
domain-specific constraints should be enforced at the application level.
Instead of using
REVOKE statements to restrict data access,
administrators use standard Hadoop commands to specify file system access
privileges for Hive data. Consequently, you should think of privileges at
a table level rather than a row or column level.
So what's this all mean to an experienced relational database administrator or designer? If you'd like to create a simple test database in Hadoop, you can do that pretty easily with Big SQL. Indeed, the Quick Start edition of BigInsights includes scripts and a step-by-step tutorial for creating a version of the GOSALES data warehouse database (see Resources). But designing a production database in BigInsights means you'll need to understand certain Hadoop-based technologies so that you can implement your design in a way that's appropriate for the platform.
Once you have at least one Big SQL table defined, you can query it using
syntax that conforms to the SQL standard. Big SQL supports
SELECT statements that project, restrict, join,
union, order, and group data. Sub-queries and common table expression
(queries that start with a
WITH clause) are
also supported. Dozens of built-in functions are provided, including some
specifically for Hive compatibility. Windowing functions are also
supported. SQL programmers can also limit for the number of rows returned
by a given query, if desired.
Listing 2 illustrates a query that joins data from four Big SQL tables, yielding a result set with three columns and limiting the number of returned rows to 50.
Listing 2. Sample Big SQL join query
SELECT pnumb.product_name, sales.quantity, meth.order_method_en FROM gosalesdw.sls_sales_fact sales, gosalesdw.sls_product_dim prod, gosalesdw.sls_product_lookup pnumb, gosalesdw.sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key LIMIT 50;
As mentioned, Big SQL supports complex data types, namely
Perhaps you're curious about how you'd query tables containing columns of
such types. Let's use the example in Listing 3 to explore this a bit.
To begin, Listing 3 creates an EMPLOYEES table with three columns, two of
which are based on complex data types. The PHONE column stores an array of
phone numbers, while the ADDRESS column stores a structure containing four
fields that comprise a USA mailing address, including the street name,
city name, state name, and zip code. A colon (
:) separates items in
the PHONE and ADDRESS columns.
After loading some sample data, shown as a comment in the listing, Big SQL users can query this data using simple array notations for the PHONE data and path expressions for the ADDRESS data. The final statement in Listing 3 retrieves the name, first phone number, and ZIP code of all employees.
Listing 3. Working with complex data types
CREATE TABLE employees ( name VARCHAR(100), phones ARRAY<VARCHAR(12)>, address STRUCT<street:VARCHAR(100), city:VARCHAR(100), state:VARCHAR(2), zip:VARCHAR(5)> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':'; /* table data : Uttam,408-111-1111:408-222-2222,555 Bailey Ave:San Jose:CA:95141 Cindy,408-333-3333:408-444-4444,555 Bailey Ave:San Jose:CA:95141 */ LOAD... ; SELECT name, phones, address.zip FROM employees;
As you might imagine, we can't cover all Big SQL query capabilities here. For details, consult the BigInsights Information Center (see Resources). However, one query topic we'll discuss in a moment involves query optimization hints.
For now, let's step back and consider some other aspects of Big
SQL. As we mentioned, SQL
DELETE statements aren't supported in
BigInsights 2.1, and
INSERT operations are
supported only for HBase tables. To populate Big SQL tables, you can load
data directly from supported relational DBMS systems or from files. In addition,
you can also create new tables based on Big SQL query result sets by
CREATE TABLE AS . . . . SELECT FROM . . .
Traditional transaction management isn't part of the Hadoop ecosystem, so
Big SQL operates without transactions or lock management. This implies
that commit and rollback operations aren't supported, and certain
concurrent operations can result in an application or query error. (For
example, if someone issues a
LOAD . . . OVERWRITE command for a table that's
actively being queried, the query will receive an exception.)
Since performance is frequently a concern when working with information management systems, we'll briefly discuss some performance considerations for Big SQL. Big SQL divides execution of a query into parts, such as join, group-by, etc. Depending on the query, the amount of data, configuration settings, and other factors, Big SQL can execute these parts serially or in parallel. Parallelism is achieved by leveraging Hadoop's MapReduce framework. As you might expect, a single query may spawn multiple MapReduce jobs. The MapReduce framework executes each job in parallel using multiple mappers or reducers (tasks). This can be quite beneficial for complex queries over large data sets.
However, launching a MapReduce job involves a certain amount of processing overhead. This overhead can outweigh the benefits of parallel processing for certain types of queries, such as those that operate over small data sets or retrieve data associated with a specific HBase row key. In such cases, it's better for the query to be executed serially on a single node. This is sometimes referred to as "local" query execution, and it's something that Big SQL supports in addition to MapReduce parallelism.
Commercial relational DBMS systems employ sophisticated cost-based optimizers to
select an efficient data access strategy for their queries, consulting
extensive statistics related to table size, data skew, etc. Big
SQL's query optimizer also consults certain statistics dynamically to
determine an efficient data access strategy. However, in some cases, Big
SQL may not have sufficient statistics available. For example, its
underlying data source may not provide such information. In such cases, it
can be helpful for Big SQL programmers to embed optimization hints into
their queries, as doing so can enable Big SQL to generate a better
execution plan. Hints can relate to the query execution mode (local or
parallel), join method, index usage, etc. In Big SQL, query hints
take the form of
/*+ name=value[, name=value ..] +*/.
Consider the query shown in Listing 4, which joins a fact table (FACT) and
a dimension table (DIM). Note that there is also a restriction on the rows
of interest from the dimension table. Given that dimension tables are
typically much smaller than fact tables and that our query only is
interested in a subset of rows in the dimension table, it's prudent for us
to supply a hint to Big SQL indicating that the dimension table is
"small." This hint appears immediately after we referenced the table in
FROM clause. Supplying this hint helps Big
SQL select an appropriate join methodology for this query in which the
qualifying rows will be brought into memory and joined with the data from
the fact table.
Listing 4. Embedding a hint in a query
select * from fact, dim /*+ tablesize='small' +*/ where dim.dim_id = fact.dim_id and dim.id < 100;
Since we mentioned indices earlier, let's discuss them briefly. Big SQL
enables you to create secondary indices for HBase using a
CREATE INDEX statement. As you might imagine,
these indices can improve the runtime performance of queries that filter
on indexed columns. HBase indices can be based on a single or composite
key, and using Big SQL to insert data or load data from a file into an
HBase table will automatically update its indices. However, in BigInsights
2.1, loading data from a remote relational database into an HBase table
will not automatically update its secondary indices. Instead, an
administrator needs to drop and re-create the necessary indices.
In Hadoop, programmers or administrators often modify job properties to tune runtime performance. While Big SQL tries to choose optimal properties, you can override these properties at the server or query level, if needed. For example, the first statement in Listing 5 modifies a particular MapReduce job property for a given query connection. It specifies that each subsequent query will use one reduce task per 100 MB of table data. Later in the listing, this setting is overridden, causing subsequent queries for the connection to use one reduce task per 200 MB of table data.
Listing 5. Setting a MapReduce job property
SET bigsql.reducers.byte.per.reducer = 104857600; -- All the MapReduce jobs spawned by all future queries -- in this connection will use 1 reducer per 100MB of table data SELECT …. ; SET bigsql.reducers.byte.per.reducer = 209715200; -- Now all the MapReduce jobs spawned by all future queries -- in this connection will use 1 reducer per 200MB of table data SELECT …. ;
Big SQL brings an industry-standard query interface to IBM's Hadoop-based platform: InfoSphere BigInsights. While Big SQL doesn't turn BigInsights into a relational DBMS, it does provide experience SQL users with a familiar on-ramp to an increasingly popular environment for analyzing and storing big data.
In this article, we introduced the basics of Big SQL and highlighted some areas in which BigInsights (with its Big SQL support) differs from a traditional relational DBMS. If you're eager to learn more, download the Quick Start edition of BigInsights and follow the "Developing Big SQL queries to analyze big data" tutorial provided with the product's Information Center, or see the tutorial in the InfoSphere BigInsights tutorial collection (see Resources).
Thanks to the following people who contributed materials to this article or helped to review it, in alphabetical order: Bruce Brown, Seeling Cheung, Scott Gray, Mark Hager, Ellen Patterson, and Bert Van der Linden.
- Learn more about BigInsights 2.1 by accessing the BigInsights Information Center..
- Dig deeper in the "Developing Big SQL queries to analyze big data" tutorial in the InfoSphere BigInsights tutorial collection (PDF).
- Read "Understanding InfoSphere BigInsights" to learn more about the product's architecture and underlying technologies.
- Watch BigInsights experts discuss the technology, give demos, and answer common questions on the IBM big data channel on YouTube.
- Enroll in free online courses at Big Data University.
- Learn more about JSqsh by visiting it at SourceForge.
- Learn more about HBase and Hive at Apache.org.
- Get familiar with the Cognos sample GOSALES databases by accessing the product's Information Center.
- Learn more about big data in the developerWorks big data content area. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Stay current with developerWorks technical events and webcasts.
- Follow developerWorks on Twitter.
Get products and technologies
- Download a free native software installation copy of InfoSphere BigInsights 2.1 Quick Start Edition (sign-in required).
- Download a free VMware image of InfoSphere BigInsights 2.1 Quick Start Edition (sign-in required).
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Discuss Big SQL and connect with other BigInsights users through the BigInsights forum.
- Check out the developerWorks blogs and get involved in the developerWorks community.
Dig deeper into Big data on developerWorks
Experiment with new directions in software development.
Tools and technologies for building enterprise analytics solutions.
Software development in the cloud. Register today and get free private projects through 2014.
Evaluate IBM software and solutions, and transform challenges into opportunities.