What's the big deal about Big SQL?

Introducing relational DBMS users to IBM's SQL technology for Hadoop

If you specialize in relational database management technology, you've probably heard a lot about "big data" and the open source Apache Hadoop project. Perhaps you've also heard about IBM's new Big SQL technology, which enables InfoSphere® BigInsights™ users to query Hadoop data using industry-standard SQL. Curious? This article introduces you to Big SQL, answering many of the common questions that relational DBMS users have about this IBM technology.

Cynthia M. Saracco (saracco@us.ibm.com), Senior Software Engineer, IBM

photo: Cynthia SaraccoCynthia M. Saracco is a senior solutions architect at IBM's Silicon Valley Laboratory, specializing in emerging technologies and information management. She has more than 25 years of software industry experience, has written three books and more than 70 technical papers, and holds seven patents.


developerWorks Master author
        level

Uttam Jain, Software Architect, IBM

Uttam JainUttam Jain is a software architect at IBM's Silicon Valley Laboratory. Prior to co-architecting Big SQL, he worked in the DB2 for Linux, UNIX, and Windows compiler, runtime and storage/kernel areas for 12 years. His current interests are distributed systems and query optimization.



14 June 2013

Also available in Chinese Russian

Big data: It's a big deal for IT professionals who analyze and manage information. But some professionals struggle to learn how to work with big data because one of the most popular big data platforms — Apache Hadoop — introduces a wealth of new technologies, including new query and scripting languages.

Enter Big SQL, IBM's SQL interface to its Hadoop-based platform, InfoSphere BigInsights. Big SQL is designed to provide SQL developers with an easy on-ramp for querying data managed by Hadoop. It enables data administrators to create new tables for data stored in Hive, HBase, or their BigInsights distributed file system. In addition, a LOAD command enables administrators to populate Big SQL tables with data from various sources. And Big SQL's JDBC and ODBC drivers enable many existing tools to use Big SQL to query this distributed data.

Yet Big SQL doesn't turn Hadoop into one big distributed relational database. If you're curious about what Big SQL can do, we'll explore the basics here, try to clear up some common misconceptions, and answer many of the questions that relational DBMS users often have about this new technology.

The big picture about Big SQL

Big SQL is a software layer that enables IT professionals to create tables and query data in BigInsights using familiar SQL statements. To do so, programmers use standard SQL syntax and, in some cases, SQL extensions created by IBM to make it easy to exploit certain Hadoop-based technologies. You'll learn more about these topics later.

To give you an idea of what Big SQL is all about, Figure 1 illustrates its architecture and how it fits into the BigInsights Enterprise Edition 2.1 platform. (If you're not familiar with BigInsights, skim through the introductory article listed in the Resources section.)

Figure 1. Big SQL architecture
Image shows Big SQL architecture

As shown at the top, Big SQL supports JDBC and ODBC client access from Linux® and Windows® platforms. In addition, the Big SQL LOAD command can read data directly from several relational DBMS systems (IBM PureData™ Systems for Analytics powered by Netezza technology, DB2®, and Teradata) as well as from files stored locally or within the BigInsights distributed file system. BigInsights EE 2.1 can be configured to support the Hadoop Distributed File System (HDFS) or IBM's General Parallel File System with the File Placement Optimizer (GPFS-FPO).

The SQL query engine supports joins, unions, grouping, common table expressions, windowing functions, and other familiar SQL expressions. Furthermore, you can influence the data access strategy for your queries with optimization hints and configuration options. Depending on the nature of your query, your data volumes, and other factors, Big SQL can use Hadoop's MapReduce framework to process various query tasks in parallel or execute your query locally within the Big SQL server on a single node — whichever may be most appropriate for your query.

Organizations interested in Big SQL often have considerable SQL skills in-house, as well as a suite of SQL-based business intelligence applications and query/reporting tools. The idea of being able to leverage existing skills and tools — and perhaps reuse portions of existing applications — can be quite appealing to organizations new to Hadoop. Indeed, some companies with large data warehouses built on relational DBMS systems are looking to Hadoop-based platforms as a potential target for offloading "cold" or infrequently used data in a manner that still allows for query access. In other cases, organizations turn to Hadoop to analyze and filter non-traditional data (such as logs, sensor data, social media posts, etc.), ultimately feeding subsets or aggregations of this information to their relational warehouses to extend their view of products, customers, or services.

In such situations and others, Big SQL can play an important role. However, it's inappropriate to think of Big SQL as a replacement for relational DBMS technology. Big SQL was designed to complement and leverage the Hadoop-based infrastructure in BigInsights. Certain features common to relational DBMS systems aren't present in Big SQL, and certain Big SQL features aren't present in most relational DBMS systems. For example, Big SQL supports querying data, but does not support SQL UPDATE or DELETE statements. INSERT statements are supported only for HBase tables. Big SQL tables may contain columns of complex data types such as struct and array, rather than simply "flat" rows. Furthermore, several underlying storage mechanisms are supported, including:

  • Delimited files (such as comma-separated files) stored in HDFS or GPFS-FPO
  • Hive tables in sequence file format, RCFile format, etc. (Hive is Hadoop's data warehouse implementation)
  • HBase tables (HBase is Hadoop's key-value or column-based data store)

Let's explore Big SQL in greater detail so you can get a better idea of its capabilities.


Working with Big SQL

BigInsights includes several tools and interfaces for Big SQL that are largely comparable to tools and interfaces you'll find with most relational DBMS systems. As mentioned, Big SQL offers JDBC and ODBC support to Java™ technology, C, and C++ application developers through a JDBC Type 4 driver and a 32- or 64-bit ODBC driver. These Big SQL drivers include support for popular capabilities, such as prepared statements, database metadata APIs, and cancellation of an actively running statement.

In addition, the BigInsights Eclipse plug-in enables Java developers to create, test, and refine Big SQL queries and applications. Figure 2 illustrates some aspects of this plug-in, including a JDBC server connection for Big SQL (shown in the foreground) and the results of a Big SQL test run (shown in the lower-right pane).

Figure 2. The BigInsights Eclipse plug-in includes support for Big SQL development
Image shows that the BigInsights Eclipse plug-in includes support for Big SQL development

For interactive invocation of Big SQL queries, BigInsights offers a command-line interface (the JSqsh shell) and a web-based interface (accessible through the BigInsights web console). Such tools can be useful for scripting activities as well as prototyping work.

A variety of IBM and non-IBM software that supports JDBC and ODBC data sources can also be configured to work with Big SQL. As an example, Cognos Business Intelligence uses Big SQL's JDBC interface to query data, generate reports, and perform other analytical functions.

Like many other components in BigInsights, Big SQL is a service that administrators launch (or stop) as needed through the web console or a command window.


Creating tables and loading data

As you might imagine, working with Big SQL requires someone to create tables and populate them with data. And, as you might imagine, Big SQL supports a CREATE TABLE statement and a LOAD command to let you do just that. While the basic syntax of these will probably look familiar to you, there are some aspects of table creation and data loading that probably won't. That's because they're designed to exploit certain Hadoop-based technologies.

Consider the example shown in Listing 1. (If you're familiar with the sample GOSALES data warehouse database provided with Cognos Business Intelligence 10.2, you'll recognize this example as a simplified version of one of its tables. See Resources for a link to details on the Cognos database samples.)

Listing 1. Creating a Big SQL table and loading it with data from a local file
create table mygosales.product_brand_lookup (
product_brand_code int,
product_brand_en varchar(180)
)
row format delimited fields terminated by '\t';

load hive data local inpath '/home/user1/data/product.tsv'
overwrite into table mygosales.product_brand_lookup;

The CREATE TABLE statement creates a Hive table with two columns; the first captures a numeric code to serve as the identifier of a product brand, while the second captures an English-based description of the brand. The final line of this statement specifies the way in which the data will store (and expect) input data: in a row format with tab-delimited fields. (As an aside, Big SQL shares its catalog with Hive and the table definitions. So when we use the term "Hive table," we also mean a Big SQL table. With a few exceptions, a table created through Big SQL can be queried by Hive and vice-versa.)

Listing 1 illustrates only one of several possible storage options for Big SQL tables. For example, you could create an externally managed Hive table in which the data would reside in your distributed file system outside a Hive database. Consequently, dropping such a table from Big SQL will only delete the metadata, not the actual data. Or you could use HBase as your underlying storage manager, in which case your CREATE TABLE statement would include syntax for mapping between SQL columns and HBase columns, including column families and a row key. If you're not familiar with Hive or HBase, see Resources.

Let's turn briefly to the LOAD statement in Listing 1. Here, we've provided the full path to a file in the local file system we want loaded into the table. Given our table definition, each record in this file must contain two fields (an integer and a string) delimited by \t (the tab character). The OVERWRITE clause instructs Big SQL to replace the contents of the table with the data contained in the file. Although not shown in Listing 1, you can eliminate the need to load data into your table if the data already exists in your BigInsights distributed file system. To do so, you would just create an external table with a LOCATION clause that specifies where your existing data resides.

In addition, there are a number of options for creating tables and loading data that affect database design and application use. For example, Hadoop-based systems are often used to store various types of semi-structured data, such as JSON data. Developers working directly with Hive commonly rely on custom serializers/deserializers (SerDes) to read and write this data. Consequently, Big SQL also supports custom SerDes and enables you to specify the required SerDe class name for your table at creation time.

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights is IBM's big data offering based on the open source Apache Hadoop project. It includes core Hadoop (Hadoop Distributed File System, MapReduce) and several other projects in the Hadoop ecosystem, such as Pig, Hive, HBase, and ZooKeeper. In addition, BigInsights includes a variety of IBM technologies that extend the value of this platform, including advanced analytical facilities, application accelerators, development tools, platform improvements, and enterprise software integration. Many of these capabilities are available with the Quick Start edition, which you can freely download for non-production use. Learn more and give it a try.

Many of the data types common to relational DBMS systems are supported by Big SQL, including some data types not supported directly by Hive or HBase. Big SQL data types include various forms of numbers (such as integers, decimal, float, and double), string data (CHAR(length), VARCHAR(length), and STRING), and TIMESTAMP. Other forms of date/time data should be stored as TIMESTAMP values or a string type. Large objects and VARGRAPHIC types aren't supported.

Views and user-defined constraints — both common in relational databases — aren't supported in this release. Referential integrity constraints and domain-specific constraints should be enforced at the application level. Instead of using GRANT and REVOKE statements to restrict data access, administrators use standard Hadoop commands to specify file system access privileges for Hive data. Consequently, you should think of privileges at a table level rather than a row or column level.

So what's this all mean to an experienced relational database administrator or designer? If you'd like to create a simple test database in Hadoop, you can do that pretty easily with Big SQL. Indeed, the Quick Start edition of BigInsights includes scripts and a step-by-step tutorial for creating a version of the GOSALES data warehouse database (see Resources). But designing a production database in BigInsights means you'll need to understand certain Hadoop-based technologies so that you can implement your design in a way that's appropriate for the platform.


Querying data

Once you have at least one Big SQL table defined, you can query it using syntax that conforms to the SQL standard. Big SQL supports SELECT statements that project, restrict, join, union, order, and group data. Sub-queries and common table expression (queries that start with a WITH clause) are also supported. Dozens of built-in functions are provided, including some specifically for Hive compatibility. Windowing functions are also supported. SQL programmers can also limit for the number of rows returned by a given query, if desired.

Listing 2 illustrates a query that joins data from four Big SQL tables, yielding a result set with three columns and limiting the number of returned rows to 50.

Listing 2. Sample Big SQL join query
SELECT pnumb.product_name, sales.quantity,
  meth.order_method_en
FROM
  gosalesdw.sls_sales_fact sales,
  gosalesdw.sls_product_dim prod,
  gosalesdw.sls_product_lookup pnumb,
  gosalesdw.sls_order_method_dim meth
WHERE
  pnumb.product_language='EN'
  AND sales.product_key=prod.product_key
  AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key
LIMIT 50;

As mentioned, Big SQL supports complex data types, namely array and struct. Perhaps you're curious about how you'd query tables containing columns of such types. Let's use the example in Listing 3 to explore this a bit.

To begin, Listing 3 creates an EMPLOYEES table with three columns, two of which are based on complex data types. The PHONE column stores an array of phone numbers, while the ADDRESS column stores a structure containing four fields that comprise a USA mailing address, including the street name, city name, state name, and zip code. A colon (:) separates items in the PHONE and ADDRESS columns.

After loading some sample data, shown as a comment in the listing, Big SQL users can query this data using simple array notations for the PHONE data and path expressions for the ADDRESS data. The final statement in Listing 3 retrieves the name, first phone number, and ZIP code of all employees.

Listing 3. Working with complex data types
CREATE TABLE employees
(
  name VARCHAR(100),
  phones ARRAY<VARCHAR(12)>,
  address STRUCT<street:VARCHAR(100), city:VARCHAR(100), state:VARCHAR(2), zip:VARCHAR(5)>
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':';

/*
 table data :
Uttam,408-111-1111:408-222-2222,555 Bailey Ave:San Jose:CA:95141
Cindy,408-333-3333:408-444-4444,555 Bailey Ave:San Jose:CA:95141
*/

LOAD... ;

SELECT name, phones[1], address.zip FROM employees;

As you might imagine, we can't cover all Big SQL query capabilities here. For details, consult the BigInsights Information Center (see Resources). However, one query topic we'll discuss in a moment involves query optimization hints.

For now, let's step back and consider some other aspects of Big SQL. As we mentioned, SQL UPDATE and DELETE statements aren't supported in BigInsights 2.1, and INSERT operations are supported only for HBase tables. To populate Big SQL tables, you can load data directly from supported relational DBMS systems or from files. In addition, you can also create new tables based on Big SQL query result sets by issuing a CREATE TABLE AS . . . . SELECT FROM . . . statement.

Traditional transaction management isn't part of the Hadoop ecosystem, so Big SQL operates without transactions or lock management. This implies that commit and rollback operations aren't supported, and certain concurrent operations can result in an application or query error. (For example, if someone issues a LOAD . . . OVERWRITE command for a table that's actively being queried, the query will receive an exception.)


Performance considerations

Since performance is frequently a concern when working with information management systems, we'll briefly discuss some performance considerations for Big SQL. Big SQL divides execution of a query into parts, such as join, group-by, etc. Depending on the query, the amount of data, configuration settings, and other factors, Big SQL can execute these parts serially or in parallel. Parallelism is achieved by leveraging Hadoop's MapReduce framework. As you might expect, a single query may spawn multiple MapReduce jobs. The MapReduce framework executes each job in parallel using multiple mappers or reducers (tasks). This can be quite beneficial for complex queries over large data sets.

However, launching a MapReduce job involves a certain amount of processing overhead. This overhead can outweigh the benefits of parallel processing for certain types of queries, such as those that operate over small data sets or retrieve data associated with a specific HBase row key. In such cases, it's better for the query to be executed serially on a single node. This is sometimes referred to as "local" query execution, and it's something that Big SQL supports in addition to MapReduce parallelism.

Commercial relational DBMS systems employ sophisticated cost-based optimizers to select an efficient data access strategy for their queries, consulting extensive statistics related to table size, data skew, etc. Big SQL's query optimizer also consults certain statistics dynamically to determine an efficient data access strategy. However, in some cases, Big SQL may not have sufficient statistics available. For example, its underlying data source may not provide such information. In such cases, it can be helpful for Big SQL programmers to embed optimization hints into their queries, as doing so can enable Big SQL to generate a better execution plan. Hints can relate to the query execution mode (local or parallel), join method, index usage, etc. In Big SQL, query hints take the form of /*+ name=value[, name=value ..] +*/.

Consider the query shown in Listing 4, which joins a fact table (FACT) and a dimension table (DIM). Note that there is also a restriction on the rows of interest from the dimension table. Given that dimension tables are typically much smaller than fact tables and that our query only is interested in a subset of rows in the dimension table, it's prudent for us to supply a hint to Big SQL indicating that the dimension table is "small." This hint appears immediately after we referenced the table in the FROM clause. Supplying this hint helps Big SQL select an appropriate join methodology for this query in which the qualifying rows will be brought into memory and joined with the data from the fact table.

Listing 4. Embedding a hint in a query
select * from fact, dim /*+ tablesize='small' +*/ 
where dim.dim_id = fact.dim_id
and dim.id < 100;

Since we mentioned indices earlier, let's discuss them briefly. Big SQL enables you to create secondary indices for HBase using a CREATE INDEX statement. As you might imagine, these indices can improve the runtime performance of queries that filter on indexed columns. HBase indices can be based on a single or composite key, and using Big SQL to insert data or load data from a file into an HBase table will automatically update its indices. However, in BigInsights 2.1, loading data from a remote relational database into an HBase table will not automatically update its secondary indices. Instead, an administrator needs to drop and re-create the necessary indices.

In Hadoop, programmers or administrators often modify job properties to tune runtime performance. While Big SQL tries to choose optimal properties, you can override these properties at the server or query level, if needed. For example, the first statement in Listing 5 modifies a particular MapReduce job property for a given query connection. It specifies that each subsequent query will use one reduce task per 100 MB of table data. Later in the listing, this setting is overridden, causing subsequent queries for the connection to use one reduce task per 200 MB of table data.

Listing 5. Setting a MapReduce job property
SET bigsql.reducers.byte.per.reducer = 104857600;
-- All the MapReduce jobs spawned by all future queries
-- in this connection will use 1 reducer per 100MB of table data

SELECT …. ;

SET bigsql.reducers.byte.per.reducer = 209715200;
-- Now all the MapReduce jobs spawned by all future queries
-- in this connection will use 1 reducer per 200MB of table data

SELECT …. ;

Summary

Big SQL brings an industry-standard query interface to IBM's Hadoop-based platform: InfoSphere BigInsights. While Big SQL doesn't turn BigInsights into a relational DBMS, it does provide experience SQL users with a familiar on-ramp to an increasingly popular environment for analyzing and storing big data.

In this article, we introduced the basics of Big SQL and highlighted some areas in which BigInsights (with its Big SQL support) differs from a traditional relational DBMS. If you're eager to learn more, download the Quick Start edition of BigInsights and follow the "Developing Big SQL queries to analyze big data" tutorial provided with the product's Information Center, or see the tutorial in the InfoSphere BigInsights tutorial collection (see Resources).


Acknowledgments

Thanks to the following people who contributed materials to this article or helped to review it, in alphabetical order: Bruce Brown, Seeling Cheung, Scott Gray, Mark Hager, Ellen Patterson, and Bert Van der Linden.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=933737
ArticleTitle=What's the big deal about Big SQL?
publish-date=06142013