Process big data with Big SQL in InfoSphere BigInsights

Run complex queries on non-tabular data and query it with a SQL-like language

SQL is a practical querying language, but is has limitations. Big SQL enables you to run complex queries on non-tabular data and query it with an SQL-like language. The difference with Big SQL is that you are accessing data that may be non-tabular, and may in fact not be based upon a typical SQL database structure. Using Big SQL, you can import and process large volume data sets, including by taking the processed output of other processing jobs within InfoSphere BigInsights™ to turn that information into easily query-able data. In this article, we look at how you can replace your existing infrastructure and queries with Big SQL, and how to take more complex queries and convert them to make use of your Big SQL environment.

Share:

Martin C. Brown, Director of Documentation

Martin BrownA professional writer for over 15 years, Martin (MC) Brown is the author and contributor to more than 26 books covering an array of topics, including the recently published Getting Started with CouchDB. His expertise spans myriad development languages and platforms: Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, Mac OS and more. He currently works as the director of documentation for Continuent.



03 December 2013

Also available in Russian

Big SQL is IBM's SQL interface to its Hadoop-based platform, InfoSphere BigInsights. Big SQL is designed to provide SQL developers with an easy on-ramp for querying data managed by Hadoop. It enables data administrators to create new tables for data stored in Apache Hive, HBase, or their InfoSphere BigInsights distributed file system.

Importing data for use in Big SQL

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF), and download BigInsights Quick Start Edition now.

Like any big data processing task, the primary step is to load the data so you can start querying it. Like other areas of the Hadoop and InfoSphere BigInsights environment, loading the data for use within Big SQL requires two steps:

  1. Create the logical table within Big SQL
  2. Connect the logical structure to the raw data or load the raw data into the logical table

One difference with the Big SQL environment is that the process can be performed in either direction: Data can be loaded through Hive and accessed through Big SQL, or tables can be created within Big SQL and then the data loaded by Big SQL into the tables.

Also, in keeping with most other such tools, the design of the system is created with the idea of batch-loading large blocks of data for querying the data. Although Big SQL is an SQL veneer over the power of Hadoop, it is still an append-only database, and INSERT and UPDATE statements are not supported.

The data itself can be in any of the typical formats (CSV or tab-delimited, for example) and Big SQL can also make use of Hive and HBase tables directly, applying the SQL or translated MapReduce across this data. If these methods do not suit your needs, a custom serializer/deserializer can be used — when processing JSON content or serialized binary objects, for example.

To load the data when using InfoSphere BigInsights, you can use the web interface to load the data directly into your existing cluster. You can do this in a variety of ways, but the client is the most convenient if you have local data files: $ /opt/ibm/biginsights/bigsql/bin/bigsql client.

After you have started the bigsql client, you must separately connect to Big SQL: bigsql > connect jdbc:bigsql://localhost:7052/default:user=biadmin;password=biadmin.

After you connect to the Big SQL database engine, you can use SHOW DATABASES, SHOW TABLES, and other commands familiar to most traditional SQL command-line interfaces: create table chicago_bus (timelog timestamp, region int, buscount int, logreads int, speed float) row format delimited fields terminated by ',' lines terminated by "\n".

If you are familiar with Hive, the basic DDL structure is identical, and you should have no trouble migrating your existing tables or import tools.

Keep in mind that there are some differences in the data types that Big SQL supports. The primary difference that often causes problems is the limitation of the date storage. Within Big SQL, date storage is limited to a time stamp, of the form: YYYY-MM-DD hh:mm:ss.nnnnn.

This format is strictly enforced. Dates only, times only, 12-hour times, and time zones are not supported. However, a flexible range of built-in functions enables you to manipulate the time stamp into more familiar year, month format, and other time values during processing. These functions are limited, however, and we'll return to this topic later when looking at query building. Try to avoid using an integer as an epoch value or in the string data type because these cannot be converted internally, even with the functions within Big SQL.

string data types and larger byte-based data are also limited, a limitation that is important to consider if you are processing BLOB data types or very long text types. The built-in string and VARCHAR types are limited to 32K each. If you need to process this type of data, it is probably more efficient to process that information within a text-processing framework, and use the output from that process to build the tables you will use with Big SQL.

Most other fixed types, such as integers, floating point values, and binary types are supported and are fundamentally untouched, but check the documentation for specifics.

Once the table has been created, the data can then be loaded into this table structure: bigsql> load hive data local inpath '/home/biadmin/chicago-bigsql.csv' into table chicago_bus.

The previous example specifies the location within the local file system of the CSV file from the data that is to be loaded. You can also access external, previously loaded data, for example from the output of a Hive job or existing HBase table. This option can be useful, as we'll see later, if you want to chain the processing of different components to feed into Big SQL.

When loading the data, Big SQL will not complain about any formatting issues; the data is only processed when performing queries. It is always a good idea to perform a simple query, such as: bigaql> SELECT * from chicago_bus LIMIT 5.

Be sure to check and confirm any formatting issues. If there are problems, update the format or structure of the source file (if possible) and overwrite the Big SQL table during the load by using OVERWRITE INTO: bigsql> load hive data local inpath '/home/biadmin/chicago-bigsql.csv' overwrite into table chicago_bus .

Note that if you don't use OVERWRITE and you repeat the import, the new data is appended. Care should be taken that you don't duplicate (or triplicate) information, or accidentally overwrite existing data.

Because Big SQL is a layer sitting on top of the core Hadoop content, you can connect from within Big SQL to external databases and run queries to load specific table data for processing. For example, to load data from a DB2 table containing all of the chicago data, use Listing 1.

Listing 1. Loading data from a DB2 table
bigsql> LOAD 
 USING JDBC CONNECTION URL 'jdbc:db2://db2-server:50000/CHICAGO' 
 WITH PARAMETERS (user = 'user',password='password')
FROM TABLE CHICAGO 
INTO TABLE chicago

But the load specification can also be a query with the addition of a WHERE clause.

Listing 2. Adding the WHERE clause
bigsql> LOAD 
 USING JDBC CONNECTION URL 'jdbc:db2://db2-server:50000/CHICAGO' 
 WITH PARAMETERS (user = 'user',password='password')
FROM TABLE CHICAGO 
WHERE REGION > 6
INTO TABLE chicago

This type of data loading is rare. More likely, you will be loading data from some other part of the cluster — when processing the output from MapReduce jobs, for example.


Exporting internal processing jobs for use in Big SQL

As mentioned, it is tempting to use Big SQL as a solution itself for the processing of queries. It really excels, however, by providing a querying interface generated by a separate Hadoop process.

For the majority of cases, Big SQL provides a benefit on top of the traditional distributed storage and processing. This enables you to process data directly with a Big SQL query, or indirectly by providing a veneer on top of the core distributed processing, which enables you to query the content.

One example of using Big SQL as a veneer is processing text with a standard MapReduce query that outputs the extracted data into a CSV format suitable for reading back into Big SQL for the processing.

Using the movie reviews data from Stanford, which provides data on movie reviews from Amazon, I ran a MapReduce process that looked for certain keywords, such as "love," "hate," "good," "best," and "worst," compared against the actual rating given to determine accuracy.

The MapReduce job processing of that information generates the parsed data back out into a file within Hadoop Distributed File System (HDFS), although I could have written directly out to a Hive table.

To import that data, you can load the file into a Hive table within Big SQL you previously created: bigsql> load hive data local inpath '/host:port/amazon/reviews-analysed.csv' into reviews. Or, to access the file directly, use the code in Listing 3.

Listing 3. Accessing files directly
create external table azreviews
(productid string, userid string, score float, confidence int)
row format delimited
fields terminated by ','
stored as textfile
location '/amazon/reviews-analysed.csv'

This creates an implied link to the underlying CSV file as the source for the table data when queried. This saves a load stage and processing, and enables Big SQL to access the live table data, even at the same time that data is being processed.


Building Big SQL queries

Another option is to entirely replace your MapReduce jobs with a corresponding Big SQL query and entirely dispense with handwritten MapReduce queries. This option requires some basic knowledge of how the querying and processing is handled by Big SQL so you can get the best out of the implied processing that Big SQL performs. Ironically, most solutions have migrated off SQL, and Big SQL gives that power back, with some caveats.

The difference between Big SQL and traditional SQL queries is actually very small. Most of the basic structure and sequence you are familiar with in traditional SQL queries will work without much modification with the basic query structure. However, because of the way Big SQL works, there are some query types and conversions that require additional thought and preparation.

For the majority of simple queries, the processing is being performed entirely internally by the Big SQL server. For larger processing, particularly those that involve aggregation, Big SQL converts the query into a query much like Hive does.

There are some rules, however, for how to correctly build a statement that executes effectively.

Basic queries

Basic queries such as SELECT with and without WHERE clauses can be processed easily.

Listing 4. Basic queries
bigsql >  select title,servings from recipe limit 10
+----------------------------------------------------------------+----+
|title                                    |servings       |
+----------------------------------------------------------------+----+
|varchar                                  |int            |
+----------------------------------------------------------------+----+
|Creamy egg and leek special              |4              |
|Chakchouka                               |4              |
|Mozzarella and olive pizza               |4              |
|Savoury pancakes                         |8              |
|Seafood filling for pancakes             |8              |
|Pesto\                                   |-              |
|Easter egg cakes                         |12             |
|Stilton walnut mousse                    |8              |
|Easy omelette                            |-              |
|Spanish omelette                         |6              |
+----------------------------------------------------------------+----+

With WHERE clauses, performance can be improved by placing the WHERE clause that restricts the most data first. As the reduction occurs, this removes more data from the rows buffered in memory and makes it easier to remove further rows.

For example, when restricting by a type and date, choose the item that would logically reduce the table data the most first, based upon the value.

Aggregates

Aggregates are the logical equivalent of the reduce step in most MapReduce queries, but care should be taken to ensure that you run the aggregation effectively on the data. Big SQL will convert it into a reduce function for you.

Aggregates such as COUNT, SUM and STDDEV work the same as in traditional SQL, but in accordance with the processing tricks already discussed, performance can be improved if you build the query in such a way as to reduce the overall load on processing for specific levels. For example, performing AVG() or SUM() on the Chicago bus data is as straightforward through a Big SQL statement as it is through traditional SQL: SELECT region,AVG(speed) FROM chicago_bus GROUP BY region.

As with normal MapReduce queries, you should try to optimize the data before applying the aggregates.


Optimizing joins

When dealing with any JOIN, the MapReduce nature of it can be improved by specifying the join that reduces the overall data set higher up in the query. This is true, for example, when processing logging information tagged with dates for specific log events and joined to the error condition. The error type tables is a join for informational purposes, but the join on the log instance (that is, where the date information is located) acts as a reducer through the associated WHERE clause.

In the query shown in Listing 5, Big SQL processes this first by joining the error description to the original table. With 100 million rows of data, that requires 100 million implied joins.

Listing 5. Optimizing joins
SELECT date,status,errordesc FROM access_log 
    JOIN errortypes on (access_log.errortype = errortypes.id) 
    JOIN access_instance on (access_log.instanceid = access_instance.id)
    WHERE access_instance.date > NOW()

The second join attaches the date information through the join, and the overall data content is reduced through the WHERE clause.

By reversing the join statements, as shown in Listing 6, the date selection is performed first and the data reduced before it is joined to the error information. The resulting query is much quicker.

Listing 6. Reversing the join statements
SELECT date,status,errordesc FROM access_log 
    JOIN access_instance on (access_log.instanceid = access_instance.id)
    JOIN errortypes on (access_log.errortype = errortypes.id) 
    WHERE access_instance.date > NOW()

Translating standard SQL statements to Big SQL

The majority of your SQL queries can be processed directly, but some queries can be modified and adapted within Big SQL to take advantage off the different environment.

Date processing

Although not strictly related to the SQL statements you write, care should be taken with different data types and values. We've already seen the core differences within the built-in data types supported by Big SQL, and particularly for dates, the differences can cause problems.

Certain queries that rely on advanced date processing to work are much more difficult to process. Listing 7 shows that grouping information on dates can be inefficient.

Listing 7. Grouping information can be inefficient
SELECT diskusage, YEAR(statdate) AS year, MONTH(statdate) AS month FROM statmon 
GROUP BY year,month WHERE statdate _YEARS_BETWEEN(statdate, 
_ADD_YEARS(statdate,2))

With Big SQL, the same query will work, but performance may be affected, especially if you also include a WHERE clause that restricts the limit. Using a sub-select may be more efficient in this case.

Listing 8. Using a sub-select may be more efficient
SELECT diskusage, YEAR(statdate) AS year, MONTH(statdate) AS month FROM
 (SELECT from diskusage, statedate FROM statmon WHERE statdate _YEARS
 _BETWEEN(statdate, _ADD_YEARS(statdate,2))) GROUP BY year,month

This performs the sub-query to select the date, then performs the year and month extraction from the date.

Groups

Be careful when performing GROUP statements. Big SQL operates on the assumption that a column listed in the SELECT clause of the query will also be in the group by query. Consider the following query, which is perfectly valid (although meaningless, since the opdata will be randomly selected from the data set) in traditional SQL environments: SELECT operation,opdata FROM access_log GROUP BY operation. Big SQL is not forgiving, and you must specify the fields explicitly: SELECT operation,opdata FROM access_log GROUP BY operation,opdata.

Chaining SQL

Queries that involve large, complicated joins may be more efficiently processed if they are converted into a series of discrete queries and processing jobs that produce intermediate tables. This method is similar to using views or temporary tables within a traditional SQL environment, but there is no explicit method for doing this in Big SQL.

Use of discrete queries is an advantage, especially if your queries rely on sub-queries, sub­selects, or multi­level aggregation over large volumes of data. This trick is counter­intuitive for some tasks because of the nature of the MapReduce processing that usually allows for multiple levels of processing within a single pass. By chaining queries, you get the data points that can be reprocessed from that data.

For example, performing a query where we are collecting information about disk usage across a cluster of machines over a period of time, the core query is particularly complex.

Listing 9. Complex core query
SELECT d.statid,d.used,class,subclass,name FROM (select min(statid) 
as matchid from statmon_du where recorded >= (date(subdate(now(),28))) 
group by pathid) AS X INNER JOIN statmon_du as d on d.statid = x.matchid join 
(statmon_du_locs) on (d.pathid = statmon_du_locs.id) where statmon_du_locs.
active = 1 ORDER BY name,subclass

This code contains a number of inner and outer joins that could be simplified by performing the inner SELECT.

Listing 10. Simplifying by performing the inner SELECT
CREATE TABLE month_statmon (statid) ROW FORMAT DELIMITED FIELDS 
TERMINATED BY '\t' AS (SELECT min(statid) as matched from statmon_du 
where recorded >= (date(subdate(now(),28))) group by pathid)

You can then join from this table to the larger query as a direct join, rather than as an explicit inner join. Big SQL can use the joined table more efficiently in this instance, although other data sets may be different.


Helping Big SQL processing

Behind the scenes, for complicated queries, Big SQL is converting the SQL into a MapReduce process that must then be run over the content to produce the desired output. In most cases, the process of translation works fine, but occasionally the format and processing can be improved. This is one area that differs from typical SQL in that the parser and optimization engine of standard SQL parsers is highly optimized to process the data.

Given these limitations, it's a good idea to provide hints to the SQL parser that will enable it to make better decisions about how the information should be processed or joined. These hints can have a significant effect on processing speed and sometimes on the content (and quality) of the processed output.

Starting at the top of the SQL parsing process, the simplest hint is one that tells Big SQL whether the data should be processed internally (that is, within the Big SQL memory space), or through a MapReduce process. For example, when aggregating by counts or sums, if there is only a small number of groups, it may be quicker this way, even for very large data sets: accessmode='local'.

Conversely, you can force what might have been a simple query to use MapReduce: accessmode='mapreduce'. These hints will be honored if you have the appropriate memory (for local) and resources available.

Regardless of the size of your data set, it is always a good idea to run queries forcing these hints and timing them. You may be surprised at the differences in timing; I've seen differences between just shy of 7 seconds and 4 minutes for the same statement with each type, sometimes in favor of either solution. But don't run the queries sequentially without restarting Big SQL because queries are cached.

It can be helpful to force local mode for the later levels of compound aggregate statements. This option forces MapReduce on the larger data, but uses in-memory aggregation when you are dealing with only a hundred or a thousand rows.


Summary

Big SQL tries to fill the gap of writing complex MapReduce processes on data by placing a wrapper around that process with a SQL-like interface. It's not perfect, and because it's translating these queries for you into internal processes or MapReduce jobs, you have to craft the SQL more carefully. Unlike with traditional SQL environments, Big SQL has no way of optimizing the query, making use of indices, or otherwise understanding the structure of the data to optimize the queries itself. In this article, you've seen classic examples of how small query modifications can make big differences to performance and how to change or write existing queries and MapReduce jobs to suit the Big SQL environment.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=954433
ArticleTitle=Process big data with Big SQL in InfoSphere BigInsights
publish-date=12032013