Big SQL is IBM's SQL interface to its Hadoop-based platform, InfoSphere BigInsights. Big SQL is designed to provide SQL developers with an easy on-ramp for querying data managed by Hadoop. It enables data administrators to create new tables for data stored in Apache Hive, HBase, or their InfoSphere BigInsights distributed file system.
Importing data for use in Big SQL
Like any big data processing task, the primary step is to load the data so you can start querying it. Like other areas of the Hadoop and InfoSphere BigInsights environment, loading the data for use within Big SQL requires two steps:
- Create the logical table within Big SQL
- Connect the logical structure to the raw data or load the raw data into the logical table
One difference with the Big SQL environment is that the process can be performed in either direction: Data can be loaded through Hive and accessed through Big SQL, or tables can be created within Big SQL and then the data loaded by Big SQL into the tables.
Also, in keeping with most other such tools, the design of the system is
created with the idea of batch-loading large blocks of data for querying
the data. Although Big SQL is an SQL veneer over the power of Hadoop, it
is still an append-only database, and
UPDATE statements are not supported.
The data itself can be in any of the typical formats (CSV or tab-delimited, for example) and Big SQL can also make use of Hive and HBase tables directly, applying the SQL or translated MapReduce across this data. If these methods do not suit your needs, a custom serializer/deserializer can be used — when processing JSON content or serialized binary objects, for example.
To load the data when using InfoSphere BigInsights, you can use the web
interface to load the data directly into your existing cluster. You can do
this in a variety of ways, but the client is the most
convenient if you have local data files:
$ /opt/ibm/biginsights/bigsql/bin/bigsql client.
After you have started the bigsql client, you must separately connect to
bigsql > connect jdbc:bigsql://localhost:7052/default:user=biadmin;password=biadmin.
After you connect to the Big SQL database engine, you can use SHOW
DATABASES, SHOW TABLES, and other commands familiar to most
traditional SQL command-line interfaces:
create table chicago_bus (timelog timestamp, region int, buscount int, logreads int, speed float) row format delimited fields terminated by ',' lines terminated by "\n".
If you are familiar with Hive, the basic DDL structure is identical, and you should have no trouble migrating your existing tables or import tools.
Keep in mind that there are some differences in the data types that Big SQL
supports. The primary difference that often causes problems is the
limitation of the date storage. Within Big SQL, date storage is limited to
a time stamp, of the form:
This format is strictly enforced. Dates only, times only, 12-hour times,
and time zones are not supported. However, a flexible range of built-in
functions enables you to manipulate the time stamp into more familiar year,
month format, and other time values during processing. These functions are
limited, however, and we'll return to this topic later when looking at
query building. Try to avoid using an integer as an
value or in the
string data type because these cannot be
converted internally, even with the functions within Big SQL.
string data types and larger byte-based data are also limited,
a limitation that is important to consider if you are processing
BLOB data types or very long
text types. The
VARCHAR types are limited to
32K each. If you need to process this type of data, it is probably more
efficient to process that information within a text-processing framework,
and use the output from that process to build the tables you will use with
Most other fixed types, such as
floating point values, and
binary types are supported and are
fundamentally untouched, but check the documentation for specifics.
Once the table has been created, the data can then be loaded into this
bigsql> load hive data local inpath '/home/biadmin/chicago-bigsql.csv' into table chicago_bus.
The previous example specifies the location within the local file system of the CSV file from the data that is to be loaded. You can also access external, previously loaded data, for example from the output of a Hive job or existing HBase table. This option can be useful, as we'll see later, if you want to chain the processing of different components to feed into Big SQL.
When loading the data, Big SQL will not complain about any
formatting issues; the data is only processed when performing queries. It
is always a good idea to perform a simple query, such as:
bigaql> SELECT * from chicago_bus LIMIT 5.
Be sure to check and confirm any formatting issues. If there are problems,
update the format or structure of the source file (if possible) and
overwrite the Big SQL table during the load by using
bigsql> load hive data local inpath '/home/biadmin/chicago-bigsql.csv' overwrite into table chicago_bus .
Note that if you don't use
OVERWRITE and you repeat the
import, the new data is appended. Care should be taken that you don't
duplicate (or triplicate) information, or accidentally overwrite existing
Because Big SQL is a layer sitting on top of the core Hadoop content, you
can connect from within Big SQL to external databases and run queries to
load specific table data for processing. For example, to load data from a
DB2 table containing all of the
chicago data, use Listing 1.
Listing 1. Loading data from a DB2 table
bigsql> LOAD USING JDBC CONNECTION URL 'jdbc:db2://db2-server:50000/CHICAGO' WITH PARAMETERS (user = 'user',password='password') FROM TABLE CHICAGO INTO TABLE chicago
But the load specification can also be a query with the addition of a
Listing 2. Adding the
bigsql> LOAD USING JDBC CONNECTION URL 'jdbc:db2://db2-server:50000/CHICAGO' WITH PARAMETERS (user = 'user',password='password') FROM TABLE CHICAGO WHERE REGION > 6 INTO TABLE chicago
This type of data loading is rare. More likely, you will be loading data from some other part of the cluster — when processing the output from MapReduce jobs, for example.
Exporting internal processing jobs for use in Big SQL
As mentioned, it is tempting to use Big SQL as a solution itself for the processing of queries. It really excels, however, by providing a querying interface generated by a separate Hadoop process.
For the majority of cases, Big SQL provides a benefit on top of the traditional distributed storage and processing. This enables you to process data directly with a Big SQL query, or indirectly by providing a veneer on top of the core distributed processing, which enables you to query the content.
One example of using Big SQL as a veneer is processing text with a standard MapReduce query that outputs the extracted data into a CSV format suitable for reading back into Big SQL for the processing.
Using the movie reviews data from Stanford, which provides data on movie reviews from Amazon, I ran a MapReduce process that looked for certain keywords, such as "love," "hate," "good," "best," and "worst," compared against the actual rating given to determine accuracy.
The MapReduce job processing of that information generates the parsed data back out into a file within Hadoop Distributed File System (HDFS), although I could have written directly out to a Hive table.
To import that data, you can load the file into a Hive table within
Big SQL you previously created:
bigsql> load hive data local
inpath '/host:port/amazon/reviews-analysed.csv' into reviews. Or, to access the
file directly, use the code in Listing 3.
Listing 3. Accessing files directly
create external table azreviews (productid string, userid string, score float, confidence int) row format delimited fields terminated by ',' stored as textfile location '/amazon/reviews-analysed.csv'
This creates an implied link to the underlying CSV file as the source for the table data when queried. This saves a load stage and processing, and enables Big SQL to access the live table data, even at the same time that data is being processed.
Building Big SQL queries
Another option is to entirely replace your MapReduce jobs with a corresponding Big SQL query and entirely dispense with handwritten MapReduce queries. This option requires some basic knowledge of how the querying and processing is handled by Big SQL so you can get the best out of the implied processing that Big SQL performs. Ironically, most solutions have migrated off SQL, and Big SQL gives that power back, with some caveats.
The difference between Big SQL and traditional SQL queries is actually very small. Most of the basic structure and sequence you are familiar with in traditional SQL queries will work without much modification with the basic query structure. However, because of the way Big SQL works, there are some query types and conversions that require additional thought and preparation.
For the majority of simple queries, the processing is being performed entirely internally by the Big SQL server. For larger processing, particularly those that involve aggregation, Big SQL converts the query into a query much like Hive does.
There are some rules, however, for how to correctly build a statement that executes effectively.
Basic queries such as
SELECT with and without
WHERE clauses can be processed easily.
Listing 4. Basic queries
bigsql > select title,servings from recipe limit 10 +----------------------------------------------------------------+----+ |title |servings | +----------------------------------------------------------------+----+ |varchar |int | +----------------------------------------------------------------+----+ |Creamy egg and leek special |4 | |Chakchouka |4 | |Mozzarella and olive pizza |4 | |Savoury pancakes |8 | |Seafood filling for pancakes |8 | |Pesto\ |- | |Easter egg cakes |12 | |Stilton walnut mousse |8 | |Easy omelette |- | |Spanish omelette |6 | +----------------------------------------------------------------+----+
WHERE clauses, performance can be improved by placing the
WHERE clause that restricts the most data first. As the
reduction occurs, this removes more data from the rows buffered in memory
and makes it easier to remove further rows.
For example, when restricting by a type and date, choose the item that would logically reduce the table data the most first, based upon the value.
Aggregates are the logical equivalent of the reduce step in most MapReduce queries, but care should be taken to ensure that you run the aggregation effectively on the data. Big SQL will convert it into a reduce function for you.
Aggregates such as
STDDEV work the same as in traditional SQL, but in accordance
with the processing tricks already discussed, performance can be improved
if you build the query in such a way as to reduce the overall load on
processing for specific levels. For example, performing
SUM() on the
Chicago bus data is as straightforward through a Big SQL statement as it
is through traditional SQL:
SELECT region,AVG(speed) FROM chicago_bus GROUP BY region.
As with normal MapReduce queries, you should try to optimize the data before applying the aggregates.
When dealing with any
JOIN, the MapReduce nature of it can be
improved by specifying the join that reduces the overall data set higher
up in the query. This is true, for example, when processing logging
information tagged with dates for specific log events and joined to the
error condition. The error type tables is a join for informational
purposes, but the join on the log instance (that is, where the date
information is located) acts as a reducer through the associated
In the query shown in Listing 5, Big SQL processes this first by joining the error description to the original table. With 100 million rows of data, that requires 100 million implied joins.
Listing 5. Optimizing joins
SELECT date,status,errordesc FROM access_log JOIN errortypes on (access_log.errortype = errortypes.id) JOIN access_instance on (access_log.instanceid = access_instance.id) WHERE access_instance.date > NOW()
The second join attaches the date information through the join, and the
overall data content is reduced through the
By reversing the join statements, as shown in Listing 6, the date selection is performed first and the data reduced before it is joined to the error information. The resulting query is much quicker.
Listing 6. Reversing the join statements
SELECT date,status,errordesc FROM access_log JOIN access_instance on (access_log.instanceid = access_instance.id) JOIN errortypes on (access_log.errortype = errortypes.id) WHERE access_instance.date > NOW()
Translating standard SQL statements to Big SQL
The majority of your SQL queries can be processed directly, but some queries can be modified and adapted within Big SQL to take advantage off the different environment.
Although not strictly related to the SQL statements you write, care should be taken with different data types and values. We've already seen the core differences within the built-in data types supported by Big SQL, and particularly for dates, the differences can cause problems.
Certain queries that rely on advanced date processing to work are much more difficult to process. Listing 7 shows that grouping information on dates can be inefficient.
Listing 7. Grouping information can be inefficient
SELECT diskusage, YEAR(statdate) AS year, MONTH(statdate) AS month FROM statmon GROUP BY year,month WHERE statdate _YEARS_BETWEEN(statdate, _ADD_YEARS(statdate,2))
With Big SQL, the same query will work, but performance may be affected,
especially if you also include a
WHERE clause that restricts
the limit. Using a sub-select may be more efficient in this case.
Listing 8. Using a sub-select may be more efficient
SELECT diskusage, YEAR(statdate) AS year, MONTH(statdate) AS month FROM (SELECT from diskusage, statedate FROM statmon WHERE statdate _YEARS _BETWEEN(statdate, _ADD_YEARS(statdate,2))) GROUP BY year,month
This performs the sub-query to select the date, then performs the year and month extraction from the date.
Be careful when performing
GROUP statements. Big SQL operates
on the assumption that a column listed in the
of the query will also be in the group by query. Consider the
following query, which is perfectly valid (although meaningless, since the
opdata will be randomly selected from the data set) in
traditional SQL environments:
SELECT operation,opdata FROM
access_log GROUP BY operation. Big SQL is not forgiving, and you must specify the fields explicitly:
SELECT operation,opdata FROM access_log GROUP BY operation,opdata.
Queries that involve large, complicated joins may be more efficiently processed if they are converted into a series of discrete queries and processing jobs that produce intermediate tables. This method is similar to using views or temporary tables within a traditional SQL environment, but there is no explicit method for doing this in Big SQL.
Use of discrete queries is an advantage, especially if your queries rely on sub-queries, subselects, or multilevel aggregation over large volumes of data. This trick is counterintuitive for some tasks because of the nature of the MapReduce processing that usually allows for multiple levels of processing within a single pass. By chaining queries, you get the data points that can be reprocessed from that data.
For example, performing a query where we are collecting information about disk usage across a cluster of machines over a period of time, the core query is particularly complex.
Listing 9. Complex core query
SELECT d.statid,d.used,class,subclass,name FROM (select min(statid) as matchid from statmon_du where recorded >= (date(subdate(now(),28))) group by pathid) AS X INNER JOIN statmon_du as d on d.statid = x.matchid join (statmon_du_locs) on (d.pathid = statmon_du_locs.id) where statmon_du_locs. active = 1 ORDER BY name,subclass
This code contains a number of inner and outer joins that could be
simplified by performing the inner
Listing 10. Simplifying by performing the inner
CREATE TABLE month_statmon (statid) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' AS (SELECT min(statid) as matched from statmon_du where recorded >= (date(subdate(now(),28))) group by pathid)
You can then join from this table to the larger query as a direct join, rather than as an explicit inner join. Big SQL can use the joined table more efficiently in this instance, although other data sets may be different.
Helping Big SQL processing
Behind the scenes, for complicated queries, Big SQL is converting the SQL into a MapReduce process that must then be run over the content to produce the desired output. In most cases, the process of translation works fine, but occasionally the format and processing can be improved. This is one area that differs from typical SQL in that the parser and optimization engine of standard SQL parsers is highly optimized to process the data.
Given these limitations, it's a good idea to provide hints to the SQL parser that will enable it to make better decisions about how the information should be processed or joined. These hints can have a significant effect on processing speed and sometimes on the content (and quality) of the processed output.
Starting at the top of the SQL parsing process, the simplest hint is one
that tells Big SQL whether the data should be processed internally (that
is, within the Big SQL memory space), or through a MapReduce process. For
example, when aggregating by counts or sums, if there is only a small
number of groups, it may be quicker this way, even for very
large data sets:
Conversely, you can force what might have been a simple query to use
accessmode='mapreduce'. These hints will be
honored if you have the appropriate memory (for local) and resources
Regardless of the size of your data set, it is always a good idea to run queries forcing these hints and timing them. You may be surprised at the differences in timing; I've seen differences between just shy of 7 seconds and 4 minutes for the same statement with each type, sometimes in favor of either solution. But don't run the queries sequentially without restarting Big SQL because queries are cached.
It can be helpful to force local mode for the later levels of compound aggregate statements. This option forces MapReduce on the larger data, but uses in-memory aggregation when you are dealing with only a hundred or a thousand rows.
Big SQL tries to fill the gap of writing complex MapReduce processes on data by placing a wrapper around that process with a SQL-like interface. It's not perfect, and because it's translating these queries for you into internal processes or MapReduce jobs, you have to craft the SQL more carefully. Unlike with traditional SQL environments, Big SQL has no way of optimizing the query, making use of indices, or otherwise understanding the structure of the data to optimize the queries itself. In this article, you've seen classic examples of how small query modifications can make big differences to performance and how to change or write existing queries and MapReduce jobs to suit the Big SQL environment.
- Get a technical introduction to Big SQL on Slideshare.
- Read the "SQL to Hadoop and back again" series, which demonstrates the contents for the Chicago Traffic Tracker data set (Part 1: Basic data interchange techniques, Part 2: Leveraging HBase and Hive, Part 3: Direct Transfer and Live Data Exchange).
- Check out the Chicago Traffic Tracker.
- Read "What's the big deal about Big SQL?" to learn about IBM's SQL interface to its Hadoop-based platform, InfoSphere BigInsights. Big SQL is designed to provide SQL developers with an easy on-ramp for querying data managed by Hadoop.
- "Analyzing social media and structured data with InfoSphere BigInsights" teaches you the basics of using BigSheets to analyze social media and structured data collected through sample applications provided with InfoSphere BigInsights.
- Read "Understanding InfoSphere BigInsights" to learn more about the InfoSphere BigInsights architecture and underlying technologies.
- Watch the Big Data: Frequently Asked Questions for IBM InfoSphere BigInsights video to listen to Cindy Saracco discuss some of the frequently asked questions about IBM's Big Data platform and InfoSphere BigInsights.
- Watch Cindy Saracco demonstrate portions of the scenario described in this article in Big Data — Analyzing Social Media for Watson.
- Check out "Exploring your InfoSphere BigInsights cluster and sample applications" to learn more about InfoSphere BigInsights web console.
- Visit the BigInsights Technical Enablement wiki for technical materials, demos, training courses, news items, and more.
- Learn about the IBM Watson research project.
- Check out Big Data University for free courses on Hadoop and big data.
- Order a copy of Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data for details on two of IBM's key big data technologies.
- Learn more about Apache Hadoop, the https://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf, and HadoopDB.
- Read the Hadoop MapReduce tutorial at Apache.org.
- Read "Using MapReduce and load balancing on the cloud" to learn how to implement the Hadoop MapReduce framework in a cloud environment and how to use virtual load balancing to improve the performance of both a single- and multiple-node system.
- For information about installing Hadoop using CDH4, see CDH4 Installation — Cloudera Support.
- Don't miss Big Data Glossary, By Pete Warden, O'Reilly Media, ISBN: 1449314597, 2011.
- Check out Hadoop: The Definitive Guide, by Tom White, O'Reilly Media, ISBN: 1449389732, 2010.
- "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads" explores the feasibility of building a hybrid system that takes the best features from both technologies.
- "SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions" describes the motivation for this new approach to UDFs, as well as the implementation within AsterData Systems' nCluster database.
- Learn more by reading "MapReduce and parallel DBMSes: friends or foes?"
- "A Survey of Large Scale Data Management Approaches in Cloud Environments" gives a comprehensive survey of numerous approaches and mechanisms of deploying data-intensive applications in the cloud, which are gaining a lot of momentum in both research and industrial communities.
- Learn more about big data in the developerWorks big data content area. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based offering that extends the value of open source Hadoop with features like Big SQL, text analytics, and BigSheets.
- Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights.
- Find resources to help you get started with InfoSphere Streams, IBM's high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources.
- Stay current with developerWorks technical events and webcasts.
- Follow developerWorks on Twitter.
Get products and technologies
- Get a data set of Amazon movie reviews.
- Refer to the IBM InfoSphere BigInsights Information Center for product documentation.
- Get Hadoop 0.20.1 from Apache.org.
- Get Hadoop MapReduce.
- Get Hadoop HDFS.
- Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image.
- Download InfoSphere Streams, available as a native software installation or as a VMware image.
- Use InfoSphere Streams on IBM SmartCloud Enterprise.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Ask questions and get answers in the InfoSphere BigInsights forum.
- Ask questions and get answers in the InfoSphere Streams forum.
- Check out the developerWorks blogs and get involved in the developerWorks community.
- Check out IBM big data and analytics on Facebook.