Using IBM Big SQL over HBase, Part 1: Creating tables and loading data

Information On Demand Session 1687

With IBM's Big SQL technology, you can use InfoSphere® BigInsights™ to query HBase using industry-standard SQL. This two-part series focuses on creating tables, data-loading methods, and query handling. Here in Part 1, learn fundamental usage of IBM's Big SQL technology for Hadoop over HBase by creating tables and examining ways to load data. Follow a basic storyline of migrating a relational table to HBase using Big SQL. Part 2 explores query handling, and how to connect to Big SQL via JDBC to run business intelligence and reporting tools, such as BIRT and Cognos®.

Piotr Pruski (piotr.pruski@ca.ibm.com), Partner Enablement Engineer, IBM

Author photoPiotr Pruski is a partner enablement engineer within the Information Management Business Partner Ecosystem team at IBM. His main focus is to help accelerate sales and partner success by reaching out to and engaging business partners, enabling them to work with products within the IM portfolio -- namely, IBM InfoSphere BigInsights. In previous roles, he has worked with DB2, IBM Smart Analytics System 5710, and briefly with the IBM PureData System for Transactions.



Benjamin Leonhardi, Software Engineer, IBM

Photo of author Benjamin LeonhardiBenjamin Leonhardi is the team lead for the Big Data/Warehousing partner enablement team. Before that he was a software developer for InfoSphere Warehouse at the IBM Research And Development Lab Boeblingen, Germany. He was a developer in the data mining, text mining, and mining reporting solutions.



Deepa Remesh (dremesh@us.ibm.com), IM Advanced Technology Engineer, IBM

Deepa Remesh is a software developer for IBM big data technologies. She currently works on Big SQL, a part of the InfoSphere BigInsights offering that provides an efficient and robust SQL layer over Hadoop. Her primary focus is SQL access to HBase. Deepa previously was involved in design and development of various information management products such as Cloudscape/Apache Derby and Mashup Center. Before joining IBM, she worked at Siemens Communications developing enterprise network management software.



Bruce Brown (brownb@us.ibm.com), Senior Sales Acceleration Architect, IBM

Bruce Brown is a senior sales acceleration architect and a member of IBM's Big Data technical enablement team. He has been working in the government IT industry for more than 20 years as a technical professional supporting systems and software sales. He is a former software developer, Java specialist, and was involved in the early design and go-to-market campaigns with various Java products and technologies.



18 February 2014

Also available in Russian

Introduction

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible, including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF), and download InfoSphere BigInsights Quick Start Edition now.

This series walks you through using IBM's Big SQL technology with InfoSphere BigInsights to query HBase using standard SQL. Here, you'll see how to migrate a table from a relational database to InfoSphere BigInsights using Big SQL over HBase. You'll also explore how HBase handles row keys and learn about some pitfalls you might encounter. We'll try some useful options, such as pre-creating regions to see how it can help with data loading and queries, and cover various ways to load data.

This series covers extensive ground, so we've omitted some fundamental information. At least a rudimentary understanding of InfoSphere BigInsights, HBase, and Jaql is assumed (see Resources for more information about these technologies). You can also download the sample data used in this series.


Background

This exercise uses one table from the Great Outdoors Sales Data Warehouse model (GOSALESDW): SLS_SALES_FACT. Figure 1 shows the details of the table and its primary key information.

Figure 1. SLS_SALES_FACT table
Image shows SLS_SALES_FACT Table used throughout this exercise

Assume there is an available instance of DB2® that contains the following table with data pre-loaded for our migration.

Table/View                      Schema          Type  Creation time             
------------------------------- --------------- ----- --------------------------
SLS_SALES_FACT_10P              DB2INST1        T     2013-08-22-14.54.01.622569

  1 record(s) selected.

By issuing the select statement, as shown below, you can examine how many rows are in the table to ensure that everything will be migrated properly later.

db2 "SELECT COUNT(*) FROM sls_sales_fact_10p"

You should expect 44,603 rows in this table, as shown below.

1          
-----------
      44603

  1 record(s) selected.

Using the describe command below, examine all the columns and data types contained within this table.

db2 "DESCRIBE TABLE sls_sales_fact_10p"
Listing 1. Data types within this table
                                Data type                     Column
Column name                     schema    Data type name      Length     Scale Nulls
------------------------------- --------- ------------------- ---------- ----- ------
ORDER_DAY_KEY                   SYSIBM    INTEGER                      4     0 Yes   
ORGANIZATION_KEY                SYSIBM    INTEGER                      4     0 Yes   
EMPLOYEE_KEY                    SYSIBM    INTEGER                      4     0 Yes   
RETAILER_KEY                    SYSIBM    INTEGER                      4     0 Yes   
RETAILER_SITE_KEY               SYSIBM    INTEGER                      4     0 Yes   
PRODUCT_KEY                     SYSIBM    INTEGER                      4     0 Yes   
PROMOTION_KEY                   SYSIBM    INTEGER                      4     0 Yes   
ORDER_METHOD_KEY                SYSIBM    INTEGER                      4     0 Yes   
SALES_ORDER_KEY                 SYSIBM    INTEGER                      4     0 Yes   
SHIP_DAY_KEY                    SYSIBM    INTEGER                      4     0 Yes   
CLOSE_DAY_KEY                   SYSIBM    INTEGER                      4     0 Yes   
QUANTITY                        SYSIBM    INTEGER                      4     0 Yes   
UNIT_COST                       SYSIBM    DECIMAL                     19     2 Yes   
UNIT_PRICE                      SYSIBM    DECIMAL                     19     2 Yes   
UNIT_SALE_PRICE                 SYSIBM    DECIMAL                     19     2 Yes   
GROSS_MARGIN                    SYSIBM    DOUBLE                       8     0 Yes   
SALE_TOTAL                      SYSIBM    DECIMAL                     19     2 Yes   
GROSS_PROFIT                    SYSIBM    DECIMAL                     19     2 Yes   

  18 record(s) selected.

One-to-one mapping

In this section, we use Big SQL to do a one-to-one mapping of the columns in the relational DB2 table to an HBase table row key and columns. This is not a recommended approach. However, the goal of the exercise is to demonstrate the inefficiency and pitfalls that can occur with such a mapping.

Big SQL supports both one-to-one and many-to-one mappings.

In a one-to-one mapping, the HBase row key and each HBase column are mapped to a single SQL column. In Figure 2, the HBase row key is mapped to the SQL column id. Similarly, the cq_name column within the cf_data column family is mapped to the SQL column name, etc.

Figure 2. One-to-one mapping
Image shows one-to-one mapping example

To begin, you can optionally first create a schema to keep tables organized. Within the Big SQL (JSQSH) shell, use the create schema command to create a schema named gosalesdw, as shown below.

CREATE SCHEMA gosalesdw;

Issue the command shown below in the same Big SQL shell. This DDL statement will create the SQL table with the one-to-one mapping of what is in our relational DB2 source. Notice all the column names are the same with the same data types. The column mapping section requires a mapping for the row key. HBase columns are identified using family:qualifier.

Listing 2. HBase columns identified using family:qualifier
CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT
( 
ORDER_DAY_KEY        int, 
ORGANIZATION_KEY     int, 
EMPLOYEE_KEY         int, 
RETAILER_KEY         int, 
RETAILER_SITE_KEY    int, 
PRODUCT_KEY          int, 
PROMOTION_KEY        int, 
ORDER_METHOD_KEY     int, 
SALES_ORDER_KEY      int, 
SHIP_DAY_KEY         int, 
CLOSE_DAY_KEY        int, 
QUANTITY             int, 
UNIT_COST            decimal(19,2), 
UNIT_PRICE           decimal(19,2), 
UNIT_SALE_PRICE      decimal(19,2), 
GROSS_MARGIN         double, 
SALE_TOTAL           decimal(19,2), 
GROSS_PROFIT         decimal(19,2) 
)
COLUMN MAPPING
(
key        mapped by (ORDER_DAY_KEY), 
cf_data:cq_ORGANIZATION_KEY     mapped by (ORGANIZATION_KEY), 
cf_data:cq_EMPLOYEE_KEY         mapped by (EMPLOYEE_KEY), 
cf_data:cq_RETAILER_KEY         mapped by (RETAILER_KEY), 
cf_data:cq_RETAILER_SITE_KEY    mapped by (RETAILER_SITE_KEY), 
cf_data:cq_PRODUCT_KEY          mapped by (PRODUCT_KEY), 
cf_data:cq_PROMOTION_KEY        mapped by (PROMOTION_KEY), 
cf_data:cq_ORDER_METHOD_KEY     mapped by (ORDER_METHOD_KEY), 
cf_data:cq_SALES_ORDER_KEY      mapped by (SALES_ORDER_KEY), 
cf_data:cq_SHIP_DAY_KEY         mapped by (SHIP_DAY_KEY), 
cf_data:cq_CLOSE_DAY_KEY        mapped by (CLOSE_DAY_KEY), 
cf_data:cq_QUANTITY             mapped by (QUANTITY), 
cf_data:cq_UNIT_COST            mapped by (UNIT_COST), 
cf_data:cq_UNIT_PRICE           mapped by (UNIT_PRICE), 
cf_data:cq_UNIT_SALE_PRICE      mapped by (UNIT_SALE_PRICE), 
cf_data:cq_GROSS_MARGIN         mapped by (GROSS_MARGIN), 
cf_data:cq_SALE_TOTAL           mapped by (SALE_TOTAL), 
cf_data:cq_GROSS_PROFIT         mapped by (GROSS_PROFIT) 
);

Big SQL supports a load from source command that can load data from warehouse sources, which we'll use first. Big SQL also supports loading data from delimited files using a load hbase command, which we'll use later.

Adding new JDBC drivers

The load from source command uses Sqoop internally to do the load. Therefore, before using the load command from a Big SQL shell, you need to add the driver for the JDBC source into the Sqoop library directory, then the JSQSH terminal shared directory.

From a Linux® terminal, issue the following command (as the InfoSphere BigInsights administrator) to add the JDBC driver JAR file to access the database to the $SQOOP_HOME/lib directory.

cp /opt/ibm/db2/V10.5/java/db2jcc.jar $SQOOP_HOME/lib

From the Big SQL shell, you can examine the drivers loaded for the JSQSH terminal, as shown below.

\drivers

Copy the same DB2 driver to the JSQSH share directory with the following command.

cp /opt/ibm/db2/V10.5/java/db2jcc.jar $BIGINSIGHTS_HOME/bigsql/jsqsh/share/

When a user adds drivers, the Big SQL server must be restarted. You could do this from the web console or by using the following command from the Linux terminal.

stop.sh bigsql && start.sh bigsql

You can verify that the driver was loaded into JSQSH by using the \drivers command, as shown above.

Now that the drivers have been set, the load can finally take place. The load from source statement extracts data from a source outside of an InfoSphere BigInsights cluster (DB2 in this case) and loads that data into an InfoSphere BigInsights HBase (or Hive) table.

Issue the following command to load the SLS_SALES_FACT_10P table from DB2 into the SLS_SALES_FACT table we have defined in Big SQL.

LOAD USING JDBC CONNECTION URL 'jdbc:db2://localhost:50000/GOSALES'
WITH PARAMETERS (user = 'db2inst1',password = 'password')
FROM TABLE SLS_SALES_FACT_10P SPLIT COLUMN ORDER_DAY_KEY
INTO HBASE TABLE  gosalesdw.sls_sales_fact APPEND;

You should expect to load 44,603 rows, which is the same number of rows the select count statement on the original DB2 table verified.

44603 rows affected (total: 1m37.74s)

Try to verify this in Big SQL with a select count statement, as shown below.

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact;

Notice there is a discrepancy between the results from the load operation and the select count statement.

+----+
|    |
+----+
| 33 |
+----+
1 row in results(first row: 3.13s; total: 3.13s)

You should also verify from an HBase shell. Issue the count command, as shown below, to verify the number of rows.

count 'gosalesdw.sls_sales_fact'

It should be apparent that the results from the Big SQL statement and HBase commands conform to one another.

33 row(s) in 0.7000 seconds

However, this doesn't yet explain why there is a mismatch between the number of loaded rows and the number of retrieved rows when you query the table.

The load (and insert, to be examined later) command behaves like upsert. If a row with the same row key exists, HBase will write the new value as a new version for that column or cell. When querying the table, only the latest value is returned by Big SQL.

In many cases, this behavior could be confusing. As with our case, we tried to load data with repeating values for a row key from a DB2 table with 44,603 rows, and the load reported 44,603 rows affected. However, the select count(*) showed fewer rows (33). No errors are thrown in such scenarios, so it is always recommended to cross-check the number of rows by querying the table, as in our example.

Now that you understand that all the rows are actually versioned in HBase, we can examine a possible way to retrieve all versions of a particular row.

First, from the Big SQL shell, issue the following select query with a predicate on the order day key. In the original table, there are most likely many tuples with the same order day key.

SELECT organization_key FROM gosalesdw.sls_sales_fact WHERE order_day_key = 20070720;

As expected, you only retrieve one row, which is the latest or newest version of the row inserted into HBase with the specified order day key.

+------------------+
| organization_key |
+------------------+
|            11171 |
+------------------+
33 row(s) in 0.7000 seconds

Now, using the HBase shell, you can retrieve previous versions for a row key. Use the following get command to get the top four versions of the row with row key 20070720.

get 'gosalesdw.sls_sales_fact', '20070720',
        {COLUMN => 'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4}

Because the previous command specified only four versions (VERSIONS => 4), you only retrieve four rows in the output, as shown below.

COLUMN                                           CELL
 cf_data:cq_ORGANIZATION_KEY                     timestamp=1383365546430, value=11171
 cf_data:cq_ORGANIZATION_KEY                     timestamp=1383365546429, value=11171
 cf_data:cq_ORGANIZATION_KEY                     timestamp=1383365546428, value=11171
 cf_data:cq_ORGANIZATION_KEY                     timestamp=1383365546427, value=11171
4 row(s) in 0.0360 seconds

Optionally, try the same command again, specifying a larger version number (VERSIONS => 100, for example).

Either way, this is most likely not the intended behavior users might expect when performing such a migration. Users probably wanted to get all the data into the HBase table without versioned cells. There are a couple of solutions to this. One is to define the table with a composite row key to enforce uniqueness, which will be covered later. Another option, outlined in the next section, is to force each row key to be unique by appending a universally unique identifier (UUID).


One-to-one mapping with a unique clause

Another approach to the migration is to use the force key unique option when creating the table using Big SQL syntax. This option will force the load to add a UUID to the row key. It helps prevent versioning of cells. However, this method is quite inefficient, as it stores more data and also makes queries slower.

Issue the following command in the Big SQL shell. This statement will create the SQL table with the one-to-one mapping of what we have in our relational DB2 source. This DDL statement is almost identical to what you saw in the previous section (One-to-one mapping), with one exception: the force key unique clause is specified for the column mapping of the row key.

Listing 3. DDL statement
CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_UNIQUE 
( 
ORDER_DAY_KEY        int, 
ORGANIZATION_KEY     int, 
EMPLOYEE_KEY         int, 
RETAILER_KEY         int, 
RETAILER_SITE_KEY    int, 
PRODUCT_KEY          int, 
PROMOTION_KEY        int, 
ORDER_METHOD_KEY     int, 
SALES_ORDER_KEY      int, 
SHIP_DAY_KEY         int, 
CLOSE_DAY_KEY        int, 
QUANTITY             int, 
UNIT_COST            decimal(19,2), 
UNIT_PRICE           decimal(19,2), 
UNIT_SALE_PRICE      decimal(19,2), 
GROSS_MARGIN         double, 
SALE_TOTAL           decimal(19,2), 
GROSS_PROFIT         decimal(19,2) 
) 
COLUMN MAPPING 
( 
key        mapped by (ORDER_DAY_KEY) force key unique, 
cf_data:cq_ORGANIZATION_KEY     mapped by (ORGANIZATION_KEY), 
cf_data:cq_EMPLOYEE_KEY         mapped by (EMPLOYEE_KEY), 
cf_data:cq_RETAILER_KEY         mapped by (RETAILER_KEY), 
cf_data:cq_RETAILER_SITE_KEY    mapped by (RETAILER_SITE_KEY), 
cf_data:cq_PRODUCT_KEY          mapped by (PRODUCT_KEY), 
cf_data:cq_PROMOTION_KEY        mapped by (PROMOTION_KEY), 
cf_data:cq_ORDER_METHOD_KEY     mapped by (ORDER_METHOD_KEY), 
cf_data:cq_SALES_ORDER_KEY      mapped by (SALES_ORDER_KEY), 
cf_data:cq_SHIP_DAY_KEY         mapped by (SHIP_DAY_KEY), 
cf_data:cq_CLOSE_DAY_KEY        mapped by (CLOSE_DAY_KEY), 
cf_data:cq_QUANTITY             mapped by (QUANTITY), 
cf_data:cq_UNIT_COST            mapped by (UNIT_COST), 
cf_data:cq_UNIT_PRICE           mapped by (UNIT_PRICE), 
cf_data:cq_UNIT_SALE_PRICE      mapped by (UNIT_SALE_PRICE), 
cf_data:cq_GROSS_MARGIN         mapped by (GROSS_MARGIN), 
cf_data:cq_SALE_TOTAL           mapped by (SALE_TOTAL), 
cf_data:cq_GROSS_PROFIT         mapped by (GROSS_PROFIT) 
);

In One-to-one mapping you used the load from source command to get the data from the table in DB2 source into HBase. This may not always be feasible, so we'll explore the load hbase loading statement. The load hbase command will load data into HBase using flat files, which perhaps is an export of the data from the relational source.

Issue the following statement to load data from a file into an InfoSphere BigInsights HBase table.

LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY '\t'
INTO TABLE gosalesdw.sls_sales_fact_unique;

Note that the load hbase command can take in an optional list of columns. If no column list is specified, it will use the column ordering in table definition. The input file can be on DFS or on the local file system where the Big SQL server is running.

Once again, you should expect to load 44,603 rows (the same number of rows that the select count statement on the original DB2 table verified).

44603 rows affected (total: 26.95s)

Verify the number of rows loaded with a select count statement, as shown below.

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_unique;

This time, there is no discrepancy between the results from the load operation and the select count statement.

+-------+
|       |
+-------+
| 44603 |
+-------+
1 row in results(first row: 1.61s; total: 1.61s)

Issue the same count from the HBase shell, as shown below, to be sure.

count 'gosalesdw.sls_sales_fact_unique'

The values are persistent across load, select, and count.

...
44603 row(s) in 6.8490 seconds

As in the previous section, from the Big SQL shell, issue the following select query with a predicate on the order day key.

SELECT organization_key
FROM gosalesdw.sls_sales_fact_unique
WHERE order_day_key = 20070720;

In One-to-one mapping, only one row was returned for the specified date. This time, expect to see 1,405 rows since the rows are now forced to be unique due to our clause in the create statement and, therefore, no versioning should be applied.

1405 rows in results(first row: 0.47s; total: 0.58s)

Once again, you can check from the HBase shell if there are multiple versions of the cells. Issue the following get statement to try to retrieve the top four versions of the row with row key 20070720.

get 'gosalesdw.sls_sales_fact_unique',
	'20070720', {COLUMN => 'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4}

Zero rows are returned, as the row key of 20070720 doesn't exist. This is because we've appended the UUID to each row key (20070720 + UUID).

COLUMN                                                       CELL
0 row(s) in 0.0850 seconds

Therefore, you should instead issue the following HBase command to do a scan vs. a get. It will scan the table using the first part of the row key. We are also indicating scanner specifications of start and stop row values to only return the results we're interested in.

scan 'gosalesdw.sls_sales_fact_unique', {STARTROW => '20070720', STOPROW => '20070721'}

Notice there are no discrepancies between the results from Big SQL select and HBase scan.

1405 row(s) in 12.1350 seconds

Many-to-one mapping (composite keys and dense columns)

This section discusses the other option of trying to enforce uniqueness of the cells, which involves defining a table with a composite row key — also known as many-to-one mapping.

In a many-to-one mapping, multiple SQL columns are mapped to a single HBase entity (row key or a column). There are two terms that may be used frequently: composite key and dense column. A composite key is an HBase row key mapped to multiple SQL columns. A dense column is an HBase column mapped to multiple SQL columns.

In Figure 3, the row key contains two parts: userid and account number. Each part corresponds to an SQL column. Similarly, the HBase columns are mapped to multiple SQL columns. Note that you can have a mix. For example, you can have a composite key, a dense column, and a non-dense column or any mix of these.

Figure 3. Many-to-one mapping
Image shows any-to-one mapping

Issue the following DDL statement from the Big SQL shell. It represents all entities from our relational table using a many-to-one mapping. Notice the column mapping section where multiple columns can be mapped to single family:qualifiers.

Listing 4. DDL statement from Big SQL shell
CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE
( 
ORDER_DAY_KEY        int,
ORGANIZATION_KEY     int,
EMPLOYEE_KEY         int,
RETAILER_KEY         int,
RETAILER_SITE_KEY    int,
PRODUCT_KEY          int,
PROMOTION_KEY        int,
ORDER_METHOD_KEY     int, 
SALES_ORDER_KEY      int,
SHIP_DAY_KEY         int,
CLOSE_DAY_KEY        int, 
QUANTITY             int, 
UNIT_COST            decimal(19,2),
UNIT_PRICE           decimal(19,2),
UNIT_SALE_PRICE      decimal(19,2),
GROSS_MARGIN         double,
SALE_TOTAL           decimal(19,2),
GROSS_PROFIT         decimal(19,2)
) 
COLUMN MAPPING 
( 
key        mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY,
                        RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY), 
cf_data:cq_OTHER_KEYS       mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY), 
cf_data:cq_QUANTITY         mapped by (QUANTITY), 
cf_data:cq_DOLLAR_VALUES    mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE,
                                        GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) 
);

Why do we need many-to-one mapping?

HBase stores a lot of information for each value. For each value stored, a key consisting of the row key, column family name, column qualifier, and timestamp are also stored. A lot of duplicate information is kept.

HBase is verbose and primarily intended for sparse data. In most cases, data in the relational world is not sparse. If you were to store each SQL column individually on HBase, as previously done in this article, the required storage space would exponentially grow. When querying that data back, the query also returns the entire key (row key, column family, and column qualifier) for each value. For illustration, after loading data into this table, we'll examine the storage space for each of the three tables created thus far.

Issue the following statement, which will load data from a file into the InfoSphere BigInsights HBase table.

LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY '\t'
INTO TABLE gosalesdw.sls_sales_fact_dense;

The number of rows loaded into a table with many-to-one mapping remains the same even though we're storing less data. The statement also executes much faster than the previous load for this exact reason.

44603 rows affected (total: 3.42s)

Issue the same statements and commands from the Big SQL and HBase shells, as in the previous two sections, to verify that the number of rows is the same as in the original dataset. All the results should be the same as before.

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense;
+-------+
|       |
+-------+
| 44603 |
+-------+
1 row in results(first row: 0.93s; total: 0.93s)
SELECT organization_key
FROM gosalesdw.sls_sales_fact_dense
WHERE order_day_key = 20070720;
1405 rows in results(first row: 0.65s; total: 0.68s)
scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '20070721'}
1405 row(s) in 4.3830 seconds

As mentioned, one-to-one mapping leads to use of too much storage space for the same data mapped using composite keys or dense columns, where the HBase row key or HBase column(s) are made up of multiple relational table columns. HBase would repeat row key, column family name, column name, and timestamp for each column value. For relational data, which is usually dense, this would cause an explosion in the required storage space.

Issue the following command as the InfoSphere BigInsights administrator from a Linux terminal to check the directory sizes for the three tables you created.

hadoop fs -du /hbase/
17731926  hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact
3188      hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_dense
47906322  hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_unique

Data collation

All data represented thus far has been stored as strings, which is the default encoding on HBase tables created by Big SQL. Therefore, numeric data is not collated correctly. HBase uses lexicographic ordering, so you might have cases where a query returns wrong results.

The following scenario walks through a situation where data is not collated correctly.

Using the Big SQL insert into hbase statement, add the following row to the sls_sales_fact_dense table (previously defined with data loaded). The date specified as part of the ORDER_DAY_KEY column, which has data type int, is a larger numerical value and does not conform to any date standard because it contains an extra digit.

INSERT INTO gosalesdw.sls_sales_fact_dense (
    ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, 
    PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY
)
VALUES (200707201, 11171, 4428, 7109, 5588, 30265, 5501, 605);

Issue a scan on the table with the following start and stop criteria.

scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '20070721'}

Notice the last three rows/cells returned from the output of the scan. The newly added row shows up in the scan even though its integer value is not between 20070720 and 20070721.

200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_DOLLAR_VALUES, ...
 200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_OTHER_KEYS, ...
 200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_QUANTITY, ...
1406 row(s) in 4.2400 seconds

Insert another row into the table with the following command. This time, we're conforming to the date format of YYYYMMDD and incrementing the day by one from the last value returned in the table (20070721).

INSERT INTO gosalesdw.sls_sales_fact_dense (
    ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, 
    PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY
) 
VALUES (20070721, 11171, 4428, 7109, 5588, 30265, 5501, 605);

Issue another scan on the table. Remember to increase the stoprow criteria by one day.

scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '20070722'}

The newly added row is included as part of the result set, and the row with ORDER_DAY_KEY of 200707201 is after the row with ORDER_DAY_KEY of 20070721. This is an example of numeric data that is not collated properly. The rows are not being stored in numerical order as you might expect but rather in byte lexicographical order.

200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_DOLLAR_VALUES, ...
 200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_OTHER_KEYS, ...
 200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_QUANTITY, ...
 20070721\x0011171\x004428\x007109\x005588\x0030 column=cf_data:cq_DOLLAR_VALUES, ...
 20070721\x0011171\x004428\x007109\x005588\x0030 column=cf_data:cq_OTHER_KEYS, ...
 20070721\x0011171\x004428\x007109\x005588\x0030 column=cf_data:cq_QUANTITY, ...
1407 row(s) in 2.8840 seconds

Many-to-one mapping with binary encoding

Big SQL supports two types of data encodings: string and binary. Each HBase entity can also have its own encoding. For example, a row key can be encoded as a string, and one HBase column can be encoded as binary and another as string.

String is the default encoding used in Big SQL HBase tables. The value is converted to string and stored as UTF-8 bytes. When multiple parts are packed into one HBase entity, separators are used to delimit data. The default separator is the null byte. As it is the lowest byte, it maintains data collation and allows range queries and partial row scans to work correctly.

Binary encoding in Big SQL is sortable so numeric data, including negative numbers, collate properly. It handles separators internally and avoids issues of separators existing within data by escaping it.

Issue the following DDL statement from the Big SQL shell to create a dense table, as you did in Many-to-one mapping (composite keys and dense columns), but this time you override the default encoding to binary.

Listing 5. Override default encoding to binary
CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_BINARY 
( 
ORDER_DAY_KEY        int,
ORGANIZATION_KEY     int,
EMPLOYEE_KEY         int,
RETAILER_KEY         int,
RETAILER_SITE_KEY    int,
PRODUCT_KEY          int,
PROMOTION_KEY        int,
ORDER_METHOD_KEY     int, 
SALES_ORDER_KEY      int,
SHIP_DAY_KEY         int,
CLOSE_DAY_KEY        int, 
QUANTITY             int, 
UNIT_COST            decimal(19,2),
UNIT_PRICE           decimal(19,2),
UNIT_SALE_PRICE      decimal(19,2),
GROSS_MARGIN         double,
SALE_TOTAL           decimal(19,2),
GROSS_PROFIT         decimal(19,2)
) 
COLUMN MAPPING 
( 
key        mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY,
                        RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY),
cf_data:cq_OTHER_KEYS           mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY), 
cf_data:cq_QUANTITY             mapped by (QUANTITY),
cf_data:cq_DOLLAR_VALUES        mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE,
                                            GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT)
)
default encoding binary;

Once again, use the load hbase data command to load the data into the table. This time, we're adding the DISABLE WAL clause. The option to disable the write-ahead log (WAL) can speed up writes into HBase. However, this is not a safe option. Turning off WAL can result in data loss if a region server crashes. Another option to speed up load is to increase the write buffer size.

LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY '\t'
INTO TABLE gosalesdw.sls_sales_fact_dense_binary DISABLE WAL;
44603 rows affected (total: 5.54s)

Issue a select statement on the newly created and loaded table with binary encoding, sls_sales_fact_dense_binary.

SELECT * FROM gosalesdw.sls_sales_fact_dense_binary
go -m discard;

Note that the go -m discard option is used so the results of the command will not be displayed in the terminal.

44603 rows in results(first row: 0.35s; total: 2.89s)

Issue another select statement on the previous table that has string encoding, sls_sales_fact_dense.

SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense
go -m discard;
44605 rows in results(first row: 0.31s; total: 3.1s)

A key point here is that the query can return faster. (Numeric types are also collated properly.)

You will probably not see much, if any, performance difference when working with small datasets.

There is no custom serialization/deserialization logic required for string encoding, making it portable if you want to use another application to read data in HBase tables. A primary use case for string encoding is when someone wants to map existing data. Delimited data is a common form of storing data, and it can be easily mapped using Big SQL string encoding. However, parsing strings is expensive and queries with data encoded as strings are slow. And numeric data is not collated correctly, as shown in the example.

Queries on data encoded as binary have faster response times. Numeric data, including negative numbers, are also collated correctly with binary encoding. The downside is that you get data encoded by Big SQL logic and it might not be portable as-is.


Many-to-one mapping with HBase pre-created regions and external tables

HBase automatically handles splitting regions when they reach a set limit. In some scenarios, like bulk loading, it is more efficient to pre-create regions so the load operation can take place in parallel. In the example, the data for sales is four months — April through July 2007. You can pre-create regions by specifying splits in the create table command.

In this section, we create a table within the HBase shell with pre-defined splits, but not using any Big SQL features at first. Then we'll show how users can map existing data in HBase to Big SQL, which can prove to be a common practice. Creating external tables makes this possible.

Start by issuing the following statement in the HBase shell. The sls_sales_fact_dense_split table will be created with pre-defined region splits for April through July in 2007.

Listing 6. sls_sales_fact_dense_split table created with pre-defined region splits
create 'gosalesdw.sls_sales_fact_dense_split', {
	 NAME => 'cf_data', REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false',
	 COMPRESSION => 'NONE', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true',
	 MIN_VERSIONS => '0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', 
	 BLOOMFILTER => 'NONE', TTL => '2147483647', VERSIONS => '2147483647',
	 BLOCKSIZE => '65536'}, 
	 {SPLITS => ['200704', '200705', '200706', '200707']}

Issue the following list command on the HBase shell to verify the newly created table.

list

If you were to list the tables from the Big SQL shell, you would not see this table because we haven't made any association yet to Big SQL.

Open and point a browser to http://hostname:60010/. Scroll down and click on the table you just defined in the HBase shell, gosalesdw.sls_sales_fact_dense_split, as shown in Figure 4.

Figure 4. Splits
Image shows splits

Figure 5 shows the pre-created regions that we defined when creating the table.

Figure 5. Pre-created regions
Splits

Click to see larger image

Figure 5. Pre-created regions

Splits

Execute the following create external hbase command to map the existing table you just created in HBase to Big SQL. With the create external hbase command:

  • The create table statement lets you specify a different name for SQL tables through the hbase table name clause. Using external tables, you can also create multiple views of the same HBase table. For example, one table can map to a few columns and another table to another set of columns, etc.
  • The column mapping section of the create table statement allows you to specify a different separator for each column and row key.
  • Map tables created using Hive HBase storage handler. These cannot be directly read using Big SQL storage handler.
Listing 7. Map tables created using Hive HBase storage handler
CREATE EXTERNAL HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_SPLIT
( 
ORDER_DAY_KEY        int,
ORGANIZATION_KEY     int,
EMPLOYEE_KEY         int,
RETAILER_KEY         int,
RETAILER_SITE_KEY    int,
PRODUCT_KEY          int,
PROMOTION_KEY        int,
ORDER_METHOD_KEY     int, 
SALES_ORDER_KEY      int,
SHIP_DAY_KEY         int,
CLOSE_DAY_KEY        int, 
QUANTITY             int, 
UNIT_COST            decimal(19,2),
UNIT_PRICE           decimal(19,2),
UNIT_SALE_PRICE      decimal(19,2),
GROSS_MARGIN         double,
SALE_TOTAL           decimal(19,2),
GROSS_PROFIT         decimal(19,2)
) 
COLUMN MAPPING 
( 
key        mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY,
                        RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) 
                        SEPARATOR '-', 
cf_data:cq_OTHER_KEYS       mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY) 
                                        SEPARATOR '/', 
cf_data:cq_QUANTITY         mapped by (QUANTITY), 
cf_data:cq_DOLLAR_VALUES    mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, 
                                        GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) 
                                        SEPARATOR '|'
)
HBASE TABLE NAME 'gosalesdw.sls_sales_fact_dense_split';

The data in external tables is not validated at creation time. For example, if a column in the external table contains data with separators incorrectly defined, the query results would be unpredictable.

Note that external tables are not owned by Big SQL and, hence, cannot be dropped via Big SQL. Also, secondary indices cannot be created via Big SQL on external tables.

Use the following command to load the external table we have defined.

LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt'
DELIMITED FIELDS TERMINATED BY '\t'
INTO TABLE gosalesdw.external_sls_sales_fact_dense_split;
44603 rows affected (total: 1m57.2s)

Verify that the number of rows loaded is the same number of rows returned by querying the external SQL table.

SELECT COUNT(*) FROM gosalesdw.external_sls_sales_fact_dense_split;
+--------+
|        |
+--------+
| 446023 |
+--------+
1 row in results(first row: 6.44s; total: 6.46s)

Verify the same from the HBase shell directly on the underlying HBase table.

count 'gosalesdw.sls_sales_fact_dense_split'
...
44603 row(s) in 9.1620 seconds

Issue a get command from the HBase shell specifying the row key as follows. Notice the separator between each part of the row key is a hyphen (-), as we defined when originally creating the external table.

get 'gosalesdw.sls_sales_fact_dense_split', '20070720-11171-4428-7109-5588-30263-5501-605'

In the following output, you can also see the other separators we defined for the external table: | for the cq_DOLLAR_VALUE and / for cq_QUANTITY.

COLUMN                           CELL
 cf_data:cq_DOLLAR_VALUES        timestamp=1376690502630, 
                                     value=33.59|62.65|62.65|0.4638|1566.25|726.50
 cf_data:cq_OTHER_KEYS           timestamp=1376690502630, value=481896/20070723/20070723
 cf_data:cq_QUANTITY             timestamp=1376690502630, value=25
3 row(s) in 0. 0610 seconds

Of course, in Big SQL, you don't need to specify the separators, such as -, when querying against the table, as with the command below.

SELECT * FROM gosalesdw.external_sls_sales_fact_dense_split
WHERE ORDER_DAY_KEY = 20070720 AND ORGANIZATION_KEY = 11171 AND EMPLOYEE_KEY = 4428
AND RETAILER_KEY = 7109 AND RETAILER_SITE_KEY = 5588 AND PRODUCT_KEY = 30263
AND PROMOTION_KEY = 5501 AND ORDER_METHOD_KEY = 605;

Handling errors with load data

How do you handle errors during the load operation? The load hbase command has an option to continue past errors. You can use the LOG ERROR ROWS IN FILE clause to specify a file name to log any rows that could not be loaded because of errors. A few common errors are invalid numeric types and a separator existing within the data for string encoding.

hadoop fs -cat /user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt
2007072a        11171   …    …    …
b0070720        11171    …    …    …
2007-07-20      11171    …    …    …
20070720        11-71    …    …    …
20070721        11171    …    …    …

The separator appearing within the data is an issue with string encoding.

Knowing there are errors with the input data, go ahead and issue the following load command, specifying a directory and file in which to put the bad rows.

LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt'
DELIMITED FIELDS TERMINATED BY '\t'
INTO TABLE gosalesdw.external_sls_sales_fact_dense_split
LOG ERROR ROWS IN FILE '/tmp/SLS_SALES_FACT_load.err';

In this example, four rows did not get loaded because of errors. The load command reports all the rows that passed through it.

1 row affected (total: 2.74s)

Examine the specified file in the load command to view the rows that were not loaded.

hadoop fs -cat /tmp/SLS_SALES_FACT_load.err
"2007072a","11171","…","…","…",
"b0070720","11171","…","…","…",
"2007-07-20","11171","…","…","…",
"20070720","11-71","…","…","…",

Summary

The examples in this article have shown how to create tables and various ways to load data. We covered different types of one-to-one mapping and many-to-one mapping.

Part 2 of this series covers query handling and how to connect to Big SQL via JDBC to run business reports with tools such as BIRT or Cognos.

Acknowledgments

Thanks to Uttam Jain for his contributions to this series.


Downloads

DescriptionNameSize
Data samplesIBD-1687A_Data.zip6KB
Pre-created BIRT ReportOrders.rptdesign.zip8KB
Presentation on Big SQL over HBase1IBD-1687A.pdf3MB

Note

  1. This article is derived from a presentation at Information On Demand Session 1687 — Adding Value to HBase with IBM InfoSphere BigInsights and Big SQL.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=961966
ArticleTitle=Using IBM Big SQL over HBase, Part 1: Creating tables and loading data
publish-date=02182014