Contents
- Introduction
- Background
- One-to-one mapping
- One-to-one mapping with a
unique
clause - Many-to-one mapping (composite keys and dense columns)
- Many-to-one mapping with binary encoding
- Many-to-one mapping with HBase pre-created regions and external tables
- Handling errors with load data
- Summary
- Downloadable resources
- Related topics
- Comments
Using IBM Big SQL over HBase, Part 1
Creating tables and loading data
Information On Demand — Session 1687
Content series:
This content is part # of # in the series: Using IBM Big SQL over HBase, Part 1
This content is part of the series:Using IBM Big SQL over HBase, Part 1
Stay tuned for additional content in this series.
This series walks you through using IBM's Big SQL technology with InfoSphere BigInsights to query HBase using standard SQL. Here, you'll see how to migrate a table from a relational database to InfoSphere BigInsights using Big SQL over HBase. You'll also explore how HBase handles row keys and learn about some pitfalls you might encounter. We'll try some useful options, such as pre-creating regions to see how it can help with data loading and queries, and cover various ways to load data.
This series covers extensive ground, so we've omitted some fundamental information. At least a rudimentary understanding of InfoSphere BigInsights, HBase, and Jaql is assumed (see Related topics for more information about these technologies). You can also download the sample data used in this series.
Background
This exercise uses one table from the Great Outdoors Sales Data Warehouse model (GOSALESDW): SLS_SALES_FACT. Figure 1 shows the details of the table and its primary key information.
Figure 1. SLS_SALES_FACT table

Assume there is an available instance of DB2® that contains the following table with data pre-loaded for our migration.
Table/View Schema Type Creation time ------------------------------- --------------- ----- -------------------------- SLS_SALES_FACT_10P DB2INST1 T 2013-08-22-14.54.01.622569 1 record(s) selected.
By issuing the select
statement, as shown below, you can
examine how many rows are in the table to ensure that everything will be
migrated properly later.
db2 "SELECT COUNT(*) FROM sls_sales_fact_10p"
You should expect 44,603 rows in this table, as shown below.
1 ----------- 44603 1 record(s) selected.
Using the describe
command below, examine all the columns and
data types contained within this table.
db2 "DESCRIBE TABLE sls_sales_fact_10p"
Listing 1. Data types within this table
Data type Column Column name schema Data type name Length Scale Nulls ------------------------------- --------- ------------------- ---------- ----- ------ ORDER_DAY_KEY SYSIBM INTEGER 4 0 Yes ORGANIZATION_KEY SYSIBM INTEGER 4 0 Yes EMPLOYEE_KEY SYSIBM INTEGER 4 0 Yes RETAILER_KEY SYSIBM INTEGER 4 0 Yes RETAILER_SITE_KEY SYSIBM INTEGER 4 0 Yes PRODUCT_KEY SYSIBM INTEGER 4 0 Yes PROMOTION_KEY SYSIBM INTEGER 4 0 Yes ORDER_METHOD_KEY SYSIBM INTEGER 4 0 Yes SALES_ORDER_KEY SYSIBM INTEGER 4 0 Yes SHIP_DAY_KEY SYSIBM INTEGER 4 0 Yes CLOSE_DAY_KEY SYSIBM INTEGER 4 0 Yes QUANTITY SYSIBM INTEGER 4 0 Yes UNIT_COST SYSIBM DECIMAL 19 2 Yes UNIT_PRICE SYSIBM DECIMAL 19 2 Yes UNIT_SALE_PRICE SYSIBM DECIMAL 19 2 Yes GROSS_MARGIN SYSIBM DOUBLE 8 0 Yes SALE_TOTAL SYSIBM DECIMAL 19 2 Yes GROSS_PROFIT SYSIBM DECIMAL 19 2 Yes 18 record(s) selected.
One-to-one mapping
In this section, we use Big SQL to do a one-to-one mapping of the columns in the relational DB2 table to an HBase table row key and columns. This is not a recommended approach. However, the goal of the exercise is to demonstrate the inefficiency and pitfalls that can occur with such a mapping.
Big SQL supports both one-to-one and many-to-one mappings.
In a one-to-one mapping, the HBase row key and each HBase column
are mapped to a single SQL column. In Figure 2, the HBase row key is
mapped to the SQL column id
. Similarly, the
cq_name
column within the cf_data
column family
is mapped to the SQL column name
, etc.
Figure 2. One-to-one mapping

To begin, you can optionally first create a schema to keep tables
organized. Within the Big SQL (JSQSH) shell, use the
create schema
command to create a schema named
gosalesdw
, as shown below.
CREATE SCHEMA gosalesdw;
Issue the command shown below in the same Big SQL shell. This DDL statement
will create the SQL table with the one-to-one mapping of what is in our
relational DB2 source. Notice all the column names are the same with the
same data types. The column mapping section requires a mapping for the row
key. HBase columns are identified using family:qualifier
.
Listing 2. HBase
columns identified using
family:qualifier
CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY), cf_data:cq_ORGANIZATION_KEY mapped by (ORGANIZATION_KEY), cf_data:cq_EMPLOYEE_KEY mapped by (EMPLOYEE_KEY), cf_data:cq_RETAILER_KEY mapped by (RETAILER_KEY), cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY), cf_data:cq_PRODUCT_KEY mapped by (PRODUCT_KEY), cf_data:cq_PROMOTION_KEY mapped by (PROMOTION_KEY), cf_data:cq_ORDER_METHOD_KEY mapped by (ORDER_METHOD_KEY), cf_data:cq_SALES_ORDER_KEY mapped by (SALES_ORDER_KEY), cf_data:cq_SHIP_DAY_KEY mapped by (SHIP_DAY_KEY), cf_data:cq_CLOSE_DAY_KEY mapped by (CLOSE_DAY_KEY), cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_UNIT_COST mapped by (UNIT_COST), cf_data:cq_UNIT_PRICE mapped by (UNIT_PRICE), cf_data:cq_UNIT_SALE_PRICE mapped by (UNIT_SALE_PRICE), cf_data:cq_GROSS_MARGIN mapped by (GROSS_MARGIN), cf_data:cq_SALE_TOTAL mapped by (SALE_TOTAL), cf_data:cq_GROSS_PROFIT mapped by (GROSS_PROFIT) );
Big SQL supports a load from source
command that can load data
from warehouse sources, which we'll use first. Big SQL also supports
loading data from delimited files using a load hbase
command,
which we'll use later.
Adding new JDBC drivers
The load from source
command uses Sqoop internally to do the
load. Therefore, before using the load command from a Big SQL shell, you
need to add the driver for the JDBC source into the Sqoop library
directory, then the JSQSH terminal shared directory.
From a Linux® terminal, issue the following command (as the InfoSphere
BigInsights administrator) to add the JDBC driver JAR file to access the
database to the $SQOOP_HOME/lib
directory.
cp /opt/ibm/db2/V10.5/java/db2jcc.jar $SQOOP_HOME/lib
From the Big SQL shell, you can examine the drivers loaded for the JSQSH terminal, as shown below.
\drivers
Copy the same DB2 driver to the JSQSH share directory with the following command.
cp /opt/ibm/db2/V10.5/java/db2jcc.jar $BIGINSIGHTS_HOME/bigsql/jsqsh/share/
When a user adds drivers, the Big SQL server must be restarted. You could do this from the web console or by using the following command from the Linux terminal.
stop.sh bigsql && start.sh bigsql
You can verify that the driver was loaded into JSQSH by using the
\drivers
command, as shown above.
Now that the drivers have been set, the load can finally take place. The
load from source
statement extracts data from a source
outside of an InfoSphere BigInsights cluster (DB2 in this case) and loads
that data into an InfoSphere BigInsights HBase (or Hive) table.
Issue the following command to load the SLS_SALES_FACT_10P table from DB2 into the SLS_SALES_FACT table we have defined in Big SQL.
LOAD USING JDBC CONNECTION URL 'jdbc:db2://localhost:50000/GOSALES' WITH PARAMETERS (user = 'db2inst1',password = 'password') FROM TABLE SLS_SALES_FACT_10P SPLIT COLUMN ORDER_DAY_KEY INTO HBASE TABLE gosalesdw.sls_sales_fact APPEND;
You should expect to load 44,603 rows, which is the same number of rows the
select count
statement on the original DB2 table
verified.
44603 rows affected (total: 1m37.74s)
Try to verify this in Big SQL with a select count
statement,
as shown below.
SELECT COUNT(*) FROM gosalesdw.sls_sales_fact;
Notice there is a discrepancy between the results from the load operation
and the select count
statement.
+----+ | | +----+ | 33 | +----+ 1 row in results(first row: 3.13s; total: 3.13s)
You should also verify from an HBase shell. Issue the count
command, as shown below, to verify the number of rows.
count 'gosalesdw.sls_sales_fact'
It should be apparent that the results from the Big SQL statement and HBase commands conform to one another.
33 row(s) in 0.7000 seconds
However, this doesn't yet explain why there is a mismatch between the number of loaded rows and the number of retrieved rows when you query the table.
The load
(and insert
, to be examined later)
command behaves like upsert. If a row with the same row key
exists, HBase will write the new value as a new version for that column or
cell. When querying the table, only the latest value is returned by Big
SQL.
In many cases, this behavior could be confusing. As with our case, we tried
to load data with repeating values for a row key from a DB2 table with
44,603 rows, and the load reported 44,603 rows affected. However, the
select count(*)
showed fewer rows (33). No errors are thrown
in such scenarios, so it is always recommended to cross-check the number
of rows by querying the table, as in our example.
Now that you understand that all the rows are actually versioned in HBase, we can examine a possible way to retrieve all versions of a particular row.
First, from the Big SQL shell, issue the following select
query with a predicate on the order day key. In the original table, there
are most likely many tuples with the same order day key.
SELECT organization_key FROM gosalesdw.sls_sales_fact WHERE order_day_key = 20070720;
As expected, you only retrieve one row, which is the latest or newest version of the row inserted into HBase with the specified order day key.
+------------------+ | organization_key | +------------------+ | 11171 | +------------------+ 33 row(s) in 0.7000 seconds
Now, using the HBase shell, you can retrieve previous versions for a row
key. Use the following get
command to get the top four
versions of the row with row key 20070720.
get 'gosalesdw.sls_sales_fact', '20070720', {COLUMN => 'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4}
Because the previous command specified only four versions
(VERSIONS => 4
), you only retrieve four rows in the
output, as shown below.
COLUMN CELL cf_data:cq_ORGANIZATION_KEY timestamp=1383365546430, value=11171 cf_data:cq_ORGANIZATION_KEY timestamp=1383365546429, value=11171 cf_data:cq_ORGANIZATION_KEY timestamp=1383365546428, value=11171 cf_data:cq_ORGANIZATION_KEY timestamp=1383365546427, value=11171 4 row(s) in 0.0360 seconds
Optionally, try the same command again, specifying a larger version number
(VERSIONS => 100
, for example).
Either way, this is most likely not the intended behavior users might expect when performing such a migration. Users probably wanted to get all the data into the HBase table without versioned cells. There are a couple of solutions to this. One is to define the table with a composite row key to enforce uniqueness, which will be covered later. Another option, outlined in the next section, is to force each row key to be unique by appending a universally unique identifier (UUID).
One-to-one mapping with a unique
clause
Another approach to the migration is to use the
force key unique
option when creating the table using Big SQL
syntax. This option will force the load to add a UUID to the row key. It
helps prevent versioning of cells. However, this method is quite
inefficient, as it stores more data and also makes queries slower.
Issue the following command in the Big SQL shell. This statement will
create the SQL table with the one-to-one mapping of what we have in our
relational DB2 source. This DDL statement is almost identical to what you
saw in the previous section (One-to-one
mapping), with one exception: the force key unique
clause
is specified for the column mapping of the row key.
Listing 3. DDL statement
CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_UNIQUE ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY) force key unique, cf_data:cq_ORGANIZATION_KEY mapped by (ORGANIZATION_KEY), cf_data:cq_EMPLOYEE_KEY mapped by (EMPLOYEE_KEY), cf_data:cq_RETAILER_KEY mapped by (RETAILER_KEY), cf_data:cq_RETAILER_SITE_KEY mapped by (RETAILER_SITE_KEY), cf_data:cq_PRODUCT_KEY mapped by (PRODUCT_KEY), cf_data:cq_PROMOTION_KEY mapped by (PROMOTION_KEY), cf_data:cq_ORDER_METHOD_KEY mapped by (ORDER_METHOD_KEY), cf_data:cq_SALES_ORDER_KEY mapped by (SALES_ORDER_KEY), cf_data:cq_SHIP_DAY_KEY mapped by (SHIP_DAY_KEY), cf_data:cq_CLOSE_DAY_KEY mapped by (CLOSE_DAY_KEY), cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_UNIT_COST mapped by (UNIT_COST), cf_data:cq_UNIT_PRICE mapped by (UNIT_PRICE), cf_data:cq_UNIT_SALE_PRICE mapped by (UNIT_SALE_PRICE), cf_data:cq_GROSS_MARGIN mapped by (GROSS_MARGIN), cf_data:cq_SALE_TOTAL mapped by (SALE_TOTAL), cf_data:cq_GROSS_PROFIT mapped by (GROSS_PROFIT) );
In One-to-one mapping you used the
load from source
command to get the data from the table in
DB2 source into HBase. This may not always be feasible, so we'll explore
the load hbase
loading statement. The load hbase
command will load data into HBase using flat files, which perhaps is an
export of the data from the relational source.
Issue the following statement to load data from a file into an InfoSphere BigInsights HBase table.
LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.sls_sales_fact_unique;
Note that the load hbase
command can take in an optional list
of columns. If no column list is specified, it will use the column
ordering in table definition. The input file can be on DFS or on the local
file system where the Big SQL server is running.
Once again, you should expect to load 44,603 rows (the same number of rows
that the select count
statement on the original DB2 table
verified).
44603 rows affected (total: 26.95s)
Verify the number of rows loaded with a select count
statement, as shown below.
SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_unique;
This time, there is no discrepancy between the results from the load operation and the select count statement.
+-------+ | | +-------+ | 44603 | +-------+ 1 row in results(first row: 1.61s; total: 1.61s)
Issue the same count
from the HBase shell, as shown below, to
be sure.
count 'gosalesdw.sls_sales_fact_unique'
The values are persistent across load
, select
,
and count
.
... 44603 row(s) in 6.8490 seconds
As in the previous section, from the Big SQL shell, issue the following
select
query with a predicate on the order day key.
SELECT organization_key FROM gosalesdw.sls_sales_fact_unique WHERE order_day_key = 20070720;
In One-to-one mapping, only one row was returned for the specified date. This time, expect to see 1,405 rows since the rows are now forced to be unique due to our clause in the create statement and, therefore, no versioning should be applied.
1405 rows in results(first row: 0.47s; total: 0.58s)
Once again, you can check from the HBase shell if there are multiple
versions of the cells. Issue the following get
statement to
try to retrieve the top four versions of the row with row key
20070720.
get 'gosalesdw.sls_sales_fact_unique', '20070720', {COLUMN => 'cf_data:cq_ORGANIZATION_KEY', VERSIONS => 4}
Zero rows are returned, as the row key of 20070720 doesn't exist. This is because we've appended the UUID to each row key (20070720 + UUID).
COLUMN CELL 0 row(s) in 0.0850 seconds
Therefore, you should instead issue the following HBase command to do a
scan
vs. a get
. It will scan the table using the
first part of the row key. We are also indicating scanner specifications
of start
and stop
row values to only return the
results we're interested in.
scan 'gosalesdw.sls_sales_fact_unique', {STARTROW => '20070720', STOPROW => '20070721'}
Notice there are no discrepancies between the results from Big SQL
select
and HBase scan
.
1405 row(s) in 12.1350 seconds
Many-to-one mapping (composite keys and dense columns)
This section discusses the other option of trying to enforce uniqueness of the cells, which involves defining a table with a composite row key — also known as many-to-one mapping.
In a many-to-one mapping, multiple SQL columns are mapped to a single HBase entity (row key or a column). There are two terms that may be used frequently: composite key and dense column. A composite key is an HBase row key mapped to multiple SQL columns. A dense column is an HBase column mapped to multiple SQL columns.
In Figure 3, the row key contains two parts: userid and account number. Each part corresponds to an SQL column. Similarly, the HBase columns are mapped to multiple SQL columns. Note that you can have a mix. For example, you can have a composite key, a dense column, and a non-dense column or any mix of these.
Figure 3. Many-to-one mapping

Issue the following DDL statement from the Big SQL shell. It represents all
entities from our relational table using a many-to-one mapping. Notice the
column mapping section where multiple columns can be mapped to single
family:qualifiers
.
Listing 4. DDL statement from Big SQL shell
CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY), cf_data:cq_OTHER_KEYS mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY), cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_DOLLAR_VALUES mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) );
Why do we need many-to-one mapping?
HBase stores a lot of information for each value. For each value stored, a key consisting of the row key, column family name, column qualifier, and timestamp are also stored. A lot of duplicate information is kept.
HBase is verbose and primarily intended for sparse data. In most cases, data in the relational world is not sparse. If you were to store each SQL column individually on HBase, as previously done in this article, the required storage space would exponentially grow. When querying that data back, the query also returns the entire key (row key, column family, and column qualifier) for each value. For illustration, after loading data into this table, we'll examine the storage space for each of the three tables created thus far.
Issue the following statement, which will load data from a file into the InfoSphere BigInsights HBase table.
LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.sls_sales_fact_dense;
The number of rows loaded into a table with many-to-one mapping remains the same even though we're storing less data. The statement also executes much faster than the previous load for this exact reason.
44603 rows affected (total: 3.42s)
Issue the same statements and commands from the Big SQL and HBase shells, as in the previous two sections, to verify that the number of rows is the same as in the original dataset. All the results should be the same as before.
SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense;
+-------+ | | +-------+ | 44603 | +-------+ 1 row in results(first row: 0.93s; total: 0.93s)
SELECT organization_key FROM gosalesdw.sls_sales_fact_dense WHERE order_day_key = 20070720;
1405 rows in results(first row: 0.65s; total: 0.68s)
scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '20070721'}
1405 row(s) in 4.3830 seconds
As mentioned, one-to-one mapping leads to use of too much storage space for the same data mapped using composite keys or dense columns, where the HBase row key or HBase column(s) are made up of multiple relational table columns. HBase would repeat row key, column family name, column name, and timestamp for each column value. For relational data, which is usually dense, this would cause an explosion in the required storage space.
Issue the following command as the InfoSphere BigInsights administrator from a Linux terminal to check the directory sizes for the three tables you created.
hadoop fs -du /hbase/
17731926 hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact 3188 hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_dense 47906322 hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_unique
Data collation
All data represented thus far has been stored as strings, which is the default encoding on HBase tables created by Big SQL. Therefore, numeric data is not collated correctly. HBase uses lexicographic ordering, so you might have cases where a query returns wrong results.
The following scenario walks through a situation where data is not collated correctly.
Using the Big SQL insert into hbase
statement, add the
following row to the sls_sales_fact_dense table (previously defined with
data loaded). The date specified as part of the ORDER_DAY_KEY column,
which has data type int
, is a larger numerical value and does
not conform to any date standard because it contains an extra digit.
INSERT INTO gosalesdw.sls_sales_fact_dense ( ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY ) VALUES (200707201, 11171, 4428, 7109, 5588, 30265, 5501, 605);
Issue a scan on the table with the following start and stop criteria.
scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '20070721'}
Notice the last three rows/cells returned from the output of the scan. The newly added row shows up in the scan even though its integer value is not between 20070720 and 20070721.
200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_DOLLAR_VALUES, ... 200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_OTHER_KEYS, ... 200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_QUANTITY, ... 1406 row(s) in 4.2400 seconds
Insert another row into the table with the following command. This time, we're conforming to the date format of YYYYMMDD and incrementing the day by one from the last value returned in the table (20070721).
INSERT INTO gosalesdw.sls_sales_fact_dense ( ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY ) VALUES (20070721, 11171, 4428, 7109, 5588, 30265, 5501, 605);
Issue another scan on the table. Remember to increase the
stoprow
criteria by one day.
scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => '20070720', STOPROW => '2007072'}
The newly added row is included as part of the result set, and the row with
ORDER_DAY_KEY
of 200707201 is after the row with
ORDER_DAY_KEY
of 20070721. This is an example of numeric data
that is not collated properly. The rows are not being stored in numerical
order as you might expect but rather in byte lexicographical order.
200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_DOLLAR_VALUES, ... 200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_OTHER_KEYS, ... 200707201\x0011171\x004428\x007109\x005588\x003 column=cf_data:cq_QUANTITY, ... 20070721\x0011171\x004428\x007109\x005588\x0030 column=cf_data:cq_DOLLAR_VALUES, ... 20070721\x0011171\x004428\x007109\x005588\x0030 column=cf_data:cq_OTHER_KEYS, ... 20070721\x0011171\x004428\x007109\x005588\x0030 column=cf_data:cq_QUANTITY, ... 1407 row(s) in 2.8840 seconds
Many-to-one mapping with binary encoding
Big SQL supports two types of data encodings: string and binary. Each HBase entity can also have its own encoding. For example, a row key can be encoded as a string, and one HBase column can be encoded as binary and another as string.
String is the default encoding used in Big SQL HBase tables. The value is converted to string and stored as UTF-8 bytes. When multiple parts are packed into one HBase entity, separators are used to delimit data. The default separator is the null byte. As it is the lowest byte, it maintains data collation and allows range queries and partial row scans to work correctly.
Binary encoding in Big SQL is sortable so numeric data, including negative numbers, collate properly. It handles separators internally and avoids issues of separators existing within data by escaping it.
Issue the following DDL statement from the Big SQL shell to create a dense table, as you did in Many-to-one mapping (composite keys and dense columns), but this time you override the default encoding to binary.
Listing 5. Override default encoding to binary
CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_BINARY ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY), cf_data:cq_OTHER_KEYS mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY), cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_DOLLAR_VALUES mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) ) default encoding binary;
Once again, use the load hbase data
command to load the data
into the table. This time, we're adding the DISABLE WAL
clause. The option to disable the write-ahead log (WAL) can speed up
writes into HBase. However, this is not a safe option. Turning off WAL can
result in data loss if a region server crashes. Another option to speed up
load is to increase the write buffer size.
LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.sls_sales_fact_dense_binary DISABLE WAL;
44603 rows affected (total: 5.54s)
Issue a select
statement on the newly created and loaded table
with binary encoding, sls_sales_fact_dense_binary
.
SELECT * FROM gosalesdw.sls_sales_fact_dense_binary go -m discard;
Note that the go -m discard
option is used so the results of
the command will not be displayed in the terminal.
44603 rows in results(first row: 0.35s; total: 2.89s)
Issue another select
statement on the previous table that has
string encoding, sls_sales_fact_dense
.
SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense go -m discard;
44605 rows in results(first row: 0.31s; total: 3.1s)
A key point here is that the query can return faster. (Numeric types are also collated properly.)
You will probably not see much, if any, performance difference when working with small datasets.
There is no custom serialization/deserialization logic required for string encoding, making it portable if you want to use another application to read data in HBase tables. A primary use case for string encoding is when someone wants to map existing data. Delimited data is a common form of storing data, and it can be easily mapped using Big SQL string encoding. However, parsing strings is expensive and queries with data encoded as strings are slow. And numeric data is not collated correctly, as shown in the example.
Queries on data encoded as binary have faster response times. Numeric data, including negative numbers, are also collated correctly with binary encoding. The downside is that you get data encoded by Big SQL logic and it might not be portable as-is.
Many-to-one mapping with HBase pre-created regions and external tables
HBase automatically handles splitting regions when they reach a set limit.
In some scenarios, like bulk loading, it is more efficient to pre-create
regions so the load operation can take place in parallel. In the example,
the data for sales is four months — April through July 2007. You
can pre-create regions by specifying splits in the
create table
command.
In this section, we create a table within the HBase shell with pre-defined splits, but not using any Big SQL features at first. Then we'll show how users can map existing data in HBase to Big SQL, which can prove to be a common practice. Creating external tables makes this possible.
Start by issuing the following statement in the HBase shell. The
sls_sales_fact_dense_split
table will be created with
pre-defined region splits for April through July in 2007.
Listing 6. sls_sales_fact_dense_split
table
created with pre-defined region
splits
create 'gosalesdw.sls_sales_fact_dense_split', { NAME => 'cf_data', REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION => 'NONE', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS => '0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER => 'NONE', TTL => '2147483647', VERSIONS => '2147483647', BLOCKSIZE => '65536'}, {SPLITS => ['200704', '200705', '200706', '200707']}
Issue the following list
command on the HBase shell to verify
the newly created table.
list
If you were to list the tables from the Big SQL shell, you would not see this table because we haven't made any association yet to Big SQL.
Open and point a browser to http://hostname:60010/. Scroll down and click on the table you just defined in the HBase shell, gosalesdw.sls_sales_fact_dense_split, as shown in Figure 4.
Figure 4. Splits

Figure 5 shows the pre-created regions that we defined when creating the table.
Figure 5. Pre-created regions

Execute the following create external hbase
command to map the
existing table you just created in HBase to Big SQL. With the
create external hbase
command:
- The
create table
statement lets you specify a different name for SQL tables through thehbase table name
clause. Using external tables, you can also create multiple views of the same HBase table. For example, one table can map to a few columns and another table to another set of columns, etc. - The column mapping section of the
create table
statement allows you to specify a different separator for each column and row key. - Map tables created using Hive HBase storage handler. These cannot be directly read using Big SQL storage handler.
Listing 7. Map tables created using Hive HBase storage handler
CREATE EXTERNAL HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_SPLIT ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) SEPARATOR '-', cf_data:cq_OTHER_KEYS mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY) SEPARATOR '/', cf_data:cq_QUANTITY mapped by (QUANTITY), cf_data:cq_DOLLAR_VALUES mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) SEPARATOR '|' ) HBASE TABLE NAME 'gosalesdw.sls_sales_fact_dense_split';
The data in external tables is not validated at creation time. For example, if a column in the external table contains data with separators incorrectly defined, the query results would be unpredictable.
Note that external tables are not owned by Big SQL and, hence, cannot be dropped via Big SQL. Also, secondary indices cannot be created via Big SQL on external tables.
Use the following command to load the external table we have defined.
LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT.10p.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.external_sls_sales_fact_dense_split;
44603 rows affected (total: 1m57.2s)
Verify that the number of rows loaded is the same number of rows returned by querying the external SQL table.
SELECT COUNT(*) FROM gosalesdw.external_sls_sales_fact_dense_split;
+--------+ | | +--------+ | 446023 | +--------+ 1 row in results(first row: 6.44s; total: 6.46s)
Verify the same from the HBase shell directly on the underlying HBase table.
count 'gosalesdw.sls_sales_fact_dense_split'
... 44603 row(s) in 9.1620 seconds
Issue a get
command from the HBase shell specifying the row
key as follows. Notice the separator between each part of the row key is a
hyphen (-
), as we defined when originally creating the
external table.
get 'gosalesdw.sls_sales_fact_dense_split', '20070720-11171-4428-7109-5588-30263-5501-605'
In the following output, you can also see the other separators we defined
for the external table: | for the
cq_DOLLAR_VALUE
and / for
cq_QUANTITY
.
COLUMN CELL cf_data:cq_DOLLAR_VALUES timestamp=1376690502630, value=33.59|62.65|62.65|0.4638|1566.25|726.50 cf_data:cq_OTHER_KEYS timestamp=1376690502630, value=481896/20070723/20070723 cf_data:cq_QUANTITY timestamp=1376690502630, value=25 3 row(s) in 0. 0610 seconds
Of course, in Big SQL, you don't need to specify the separators, such as
-
, when querying against the table, as with the command
below.
SELECT * FROM gosalesdw.external_sls_sales_fact_dense_split WHERE ORDER_DAY_KEY = 20070720 AND ORGANIZATION_KEY = 11171 AND EMPLOYEE_KEY = 4428 AND RETAILER_KEY = 7109 AND RETAILER_SITE_KEY = 5588 AND PRODUCT_KEY = 30263 AND PROMOTION_KEY = 5501 AND ORDER_METHOD_KEY = 605;
Handling errors with load data
How do you handle errors during the load operation? The
load hbase
command has an option to continue past errors. You
can use the LOG ERROR ROWS IN FILE
clause to specify a file
name to log any rows that could not be loaded because of errors. A few
common errors are invalid numeric types and a separator existing within
the data for string encoding.
hadoop fs -cat /user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt
2007072a 11171 … … … b0070720 11171 … … … 2007-07-20 11171 … … … 20070720 11-71 … … … 20070721 11171 … … …
The separator appearing within the data is an issue with string encoding.
Knowing there are errors with the input data, go ahead and issue the following load command, specifying a directory and file in which to put the bad rows.
LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/SLS_SALES_FACT_badload.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.external_sls_sales_fact_dense_split LOG ERROR ROWS IN FILE '/tmp/SLS_SALES_FACT_load.err';
In this example, four rows did not get loaded because of errors. The
load
command reports all the rows that passed through it.
1 row affected (total: 2.74s)
Examine the specified file in the load
command to view the
rows that were not loaded.
hadoop fs -cat /tmp/SLS_SALES_FACT_load.err
"2007072a","11171","…","…","…", "b0070720","11171","…","…","…", "2007-07-20","11171","…","…","…", "20070720","11-71","…","…","…",
Summary
The examples in this article have shown how to create tables and various ways to load data. We covered different types of one-to-one mapping and many-to-one mapping.
Part 2 of this series covers query handling and how to connect to Big SQL via JDBC to run business reports with tools such as BIRT or Cognos.
Acknowledgments
Thanks to Uttam Jain for his contributions to this series.
Downloadable resources
- PDF of this content
- Data samples (IBD-1687A_Data.zip | 6KB)
- Pre-created BIRT Report (Orders.rptdesign.zip | 8KB)
- Presentation on Big SQL over HBase (IBD-1687A.pdf | 3MB): This article is derived from a presentation at Information On Demand Session 1687 — Adding Value to HBase with IBM InfoSphere BigInsights and Big SQL.
Related topics
- Learn more about BigInsights 2.1 from the BigInsights Information Center.
- Check out "What's the big deal about Big SQL?" for an introduction to Big SQL.
- Read "Understanding InfoSphere BigInsights" to learn more about the product's architecture and underlying technologies.
- Get a technical introduction to Big SQL on Slideshare.
- Learn more about HBase at Apache.org.
- Get familiar with the Cognos sample GOSALES databases by accessing the product's Information Center.
- Download a free native software installation copy of InfoSphere BigInsights 2.1 Quick Start Edition (sign-in required).