Working with Big SQL extended and complex data types

Process large volumes of data in distributed file systems (HDFS and GPFS)

Big SQL, a SQL interface introduced in InfoSphere® BigInsights™, offers many useful extended data types. In general, a data type defines the set of properties for values being represented, and these properties dictate how the values are treated. Big SQL supports a rich set of data types, including extended data types that are not supported by Apache Hive. With data types supported by Big SQL, it's easier to represent and process semi-structured data. Using the code samples and queries included, learn how to use Big SQL complex data types in simple and nested form and how to create and implement these types in an application. As an added bonus, see how to use the Serializer Deserializer (SerDe) to work with JSON data.

Neha Tomar (netomar2@in.ibm.com), Software Developer, IBM

Neha TomarNeha Tomar is working as a software developer with the Big SQL team at IBM Software Laboratory in India. Prior to this, she worked with the DB2 Install QA team for more than two years. She holds a master's degree from IIIT, Allahabad. She has about four years of experience in database technologies.



24 September 2013

Also available in Chinese Russian

Introduction to Big SQL

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF), and download BigInsights Quick Start Edition now.

The new SQL interface for InfoSphere BigInsights, Big SQL, enables SQL access to data stored in HDFS and Apache HBase. It provides the ability to create and populate tables across distributed environments and extract information from distributed data by executing SQL queries. Big SQL supports a number of data types.

Primitive data types:

  • Tinyint
  • Smallint
  • Int or Integer
  • Bigint
  • Float or Real
  • Double
  • Decimal(precision, scale) or Numeric(precision, scale)
  • String
  • Varchar(length) or Char(length)
  • Boolean
  • Timestamp
  • Binary(length)

Complex data types:

  • array
  • struct

Big SQL extended data types

In addition to primitive and complex data types, Big SQL supports extended data types, which are not available with Apache Hive. The next sections cover these extended data types.

Decimal or numeric data type

This data type can be used for decimal values with precision and scale. Values stored in data files are truncated to fit the specified precision and scale. In Apache Hive, these values are typically saved use the string data type. Use the decimal/numeric data type instead when it's necessary to limit the precision and scale value.

Syntax:decimal(precision, scale)

The following example shows the column DISCOUNT of type decimal(10,4), where 10 and 4 are precision and scale, respectively.

Create a table with decimal data type

		[localhost][biadmin] 1> create table PRODUCT (PROD_ID int, DISCOUNT decimal(10,4))
[localhost][biadmin] 2> row format delimited
[localhost][biadmin] 3> fields terminated by ',';
0 rows affected (total: 0.20s)

Real data type

The real data type is the same as the float data type and can be used for single-precision, floating-point values. The behavior is identical to the float data type.

Syntax:real

Create a table with the real data type

		[localhost][biadmin] 1> create table PRODUCT(PROD_ID int, DISCOUNT real)
[localhost][biadmin] 2> row format delimited
[localhost][biadmin] 3> fields terminated by ',';
0 rows affected (total: 0.26s)

varchar/char data type

This is a data type for variable-length character strings. Values stored in data files are truncated to fit the specified length. In Apache Hive, these values are treated as string type values. When you need to restrict the content length, use the varchar/char data type instead, since it enables you to specify size limit for the values of the column.

Syntax:varchar(length) or char(length)

Create a table with varchar data type

		[localhost][biadmin] 1> create table product(prod_id int, prod_name varchar(15))
[localhost][biadmin] 2> row format delimited
[localhost][biadmin] 3> fields terminated by ',';
0 rows affected (total: 0.12s)

Big SQL complex data types

Big SQL supports complex data types array and struct. These types are useful in representing data with a complex structure. Although these data types are not documented in the InfoSphere BigInsights 2.1 Information Center, I'm including them here to provide insight about the complex data types supported by Big SQL.

  • The array data type is used to define a data structure for homogeneous objects. Big SQL array data type can be used in simple and nested forms. Elements of array type columns can be accessed using an index —phone[1], which starts from 1 as per SQL standards, for example.
  • The struct data type is a built-in data type for Big SQL that contains a sequence of attributes, each of which can be a primitive, complex, or nested data type. As of InfoSphere BigInsights 2.1, Big SQL supports data type struct as a type for a column in a table or as a subtype for an array column. Elements of the struct type column are accessed using dot (.) notation —address.city, for example.

Understanding complex data types for Big SQL

Let's look at a detailed example to get more familiar with the Big SQL complex data types. To work with Big SQL, we need to configure and start the Big SQL server, which is a component of InfoSphere BigInsights. Check the InfoSphere BigInsights 2.1 Information Center (see Resources) for details regarding installation and configuration of Big SQL. Once Big SQL is installed and configured, it can be started using the utility bigsql in the $BIGSQL_HOME/bin directory.

Starting the Big SQL server

		biadmin@bdvm177:~> cd $BIGSQL_HOME/bin
biadmin@bdvm177:/opt/ibm/biginsights/bigsql/bin> ./bigsql start
BigSQL running, pid 18598.

Connect to the Big SQL server using a command-line client tool JSqsh, which gets installed with the server. JSqsh is a generic JDBC client tool that can be used against any data source for which the JDBC driver is available. The script jsqsh is in the $BIGSQL_HOME/bin directory. For more information, refer to the InfoSphere BigInsights 2.1 Information Center (see Resources).

Connecting to the Big SQL server using the JDBC client tool jsqsh

biadmin@bdvm177:/opt/ibm/biginsights/bigsql/bin> ./jsqsh -Ubiadmin -Ppassw0rd -dbigsql
Current schema context is 'default'
JSqsh Release 1.5-ibm, Copyright (C) 2007-2013, Scott C. Gray
Type \help for available help topics. Using JLine.
[localhost][biadmin] 1>

Tables with complex data type columns

For example, imagine a shopping cart using an online portal www.xyzshopping.com. Create a database called SHOPPING with two tables: CUSTOMER_DETAILS, which contains details of customers having an account on the portal; and ORDER_DETAILS, which has information on orders placed by customers. Refer to Table 1 and Table 2 for the structure of these tables.

Table 1. Structure of table CUSTOMER_DETAILS
Column nameColumn data type
CUSTOMER_IDvarchar(15)
FIRST_NAMEvarchar(15)
LAST_NAMEvarchar(20)
ADDRESSstruct<LINE1:varchar(15), LINE2: varchar(15), POSTAL_CODE:integer, CITY: varchar(20), STATE: varchar(20), COUNTRY: varchar (20), EMAILID: varchar(25)>
CONTACT_NUMBERarray<integer>

Big SQL also supports nesting of array and struct data types. The table ORDER_DETAILS consists of five columns where the column ITEMS is nested. It has a struct data type nested within an array data type.

Table 2. Structure of table ORDER_DETAILS
Column nameColumn data type
ORDER_IDvarchar(15)
SHIPPING_ADDRESSstruct<LINE1: varchar (15), LINE2: varchar (15), POSTAL_CODE: integer, CITY: varchar (20), STATE: varchar (20), COUNTRY: varchar (20), EMAILID: varchar (25), CONTACT_NUMBER: integer>
ITEMSarray<struct<ITEM_ID: varchar(20), QUANTITY:integer, PRICE:integer>>
ORDER_DATEtimestamp
CUSTOMER_IDvarchar(15)

Creating tables CUSTOMER_DETAILS and ORDER_DETAILS using Big SQL

Big SQL provides the create table command to create a table for the underlying storage mechanism HDFS and Apache HBase. For this example, HDFS files have been used as underlying storage. Delimiters for fields (table columns) and collection (defined by complex data types) can be specified while creating the tables using the create table command. The default is to use control characters from ASCII control codes (represented by the caret (^) symbol —^A, ^B, for example — as a delimiter for separating data values where only the first three values can be overridden with another symbol. Control A (^A) is the default delimiter for fields, and Control B (^B) is the default delimiter for collection types array and struct. For nested data types, the delimiter symbol is determined by the level of nesting.

Creating databases and tables

		[localhost][biadmin] 1> create database SHOPPING;
0 rows affected (total: 0.60s)
[localhost][biadmin] 1> use SHOPPING;
ok. (total: 0.37s)

[localhost][biadmin] 1> create table CUSTOMER_DETAILS
[localhost][biadmin] 2> (
[localhost][biadmin] 3> CUSTOMER_ID varchar(15),
[localhost][biadmin] 4> FIRST_NAME varchar(15),
[localhost][biadmin] 5> LAST_NAME varchar(20),
[localhost][biadmin] 6> ADDRESS struct<
[localhost][biadmin] 7>    LINE1:varchar(15), 
[localhost][biadmin] 8>    LINE2:varchar(15), 
[localhost][biadmin] 9>    POSTAL_CODE:integer, 
[localhost][biadmin] 10>   CITY:varchar(20), 
[localhost][biadmin] 11>   STATE:varchar(20), 
[localhost][biadmin] 12>   COUNTRY:varchar(20), 
[localhost][biadmin] 13>   EMAILID:varchar(25)>,
[localhost][biadmin] 14> CONTACT_NUMBER array<integer>
[localhost][biadmin] 15> )
[localhost][biadmin] 16> row format delimited
[localhost][biadmin] 17> fields terminated by ','
[localhost][biadmin] 18> collection items terminated by '|'
[localhost][biadmin] 19> lines terminated by '\n'
[localhost][biadmin] 20> ;
0 rows affected (total: 0.82s)

[localhost][biadmin] 1> create table ORDER_DETAILS
[localhost][biadmin] 2> (
[localhost][biadmin] 3> ORDER_ID varchar(15),
[localhost][biadmin] 4> SHIPPING_ADDRESS struct<
[localhost][biadmin] 5>    LINE1:varchar(15), 
[localhost][biadmin] 6>    LINE2:varchar(15), 
[localhost][biadmin] 7>    POSTAL_CODE:integer, 
[localhost][biadmin] 8>    CITY:varchar(20), 
[localhost][biadmin] 9>    STATE:varchar(20),
[localhost][biadmin] 10>   COUNTRY:varchar(20), 
[localhost][biadmin] 11>   EMAILID:varchar(25), 
[localhost][biadmin] 12>   CONTACT_NUMBER:integer>,
[localhost][biadmin] 13> ITEMS array<
[localhost][biadmin] 14>   struct<
[localhost][biadmin] 15>       ITEM_ID:varchar(20), 
[localhost][biadmin] 16>	QUANTITY:integer, 
[localhost][biadmin] 17>	PRICE:integer>>,
[localhost][biadmin] 18> ORDER_DATE timestamp,
[localhost][biadmin] 19> CUSTOMER_ID varchar(15)
[localhost][biadmin] 20> )
[localhost][biadmin] 21> row format delimited
[localhost][biadmin] 22> fields terminated by ','
[localhost][biadmin] 23> collection items terminated by '|'
[localhost][biadmin] 24> lines terminated by '\n'
[localhost][biadmin] 25> ;
0 rows affected (total: 2.79s)

Now that we have created two tables (CUSTOMER_DETAILS and ORDER_DETAILS under the database SHOPPING), we are ready to load data into these tables.

Listing tables in database SHOPPING

		[localhost][biadmin] 1> \show tables -s SHOPPING -e
+-----------+-------------+------------------+------------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME       | TABLE_TYPE |
+-----------+-------------+------------------+------------+
| [NULL]    | shopping    | customer_details | TABLE      |
| [NULL]    | shopping    | order_details    | TABLE      |
+-----------+-------------+------------------+------------+

Populating tables with the struct data type column

Big SQL uses the load utility, which is similar to Apache Hive utility for populating tables. The load utility does not perform any transformation for data being loaded; data files are just copied and moved to corresponding table locations. For syntax details, refer to the InfoSphere BigInsights Information Center (see Resources).

While creating the table in our example, the pipe symbol (|) and the comma (,) have been used as collection and field delimiters, respectively. Therefore, values for the various fields of the table need to be separated with commas, and if the table has any collection data type columns (array or struct), the collection items must be delimited by the pipe symbol. For the given example, value of STRUCT type column CUSTOMER_DETAILS.ADDRESS can be specified as below where pipe symbol separates the various attributes of STRUCT:

CMH Road|Indira Nagar|560008|Bangalore|Karnataka|India|neha@in.ibm.com

Sample row for CUSTOMER_DETAILS table

		CUST001,Neha,Tomar,CMH Road|Indira Nagar|560008|Bangalore|Karnataka|India
|neha@in.ibm.com,998891234

Now the table can be populated using the load utility.

Populating table CUSTOMER_DETAILS

		[localhost][biadmin] 1> load hive data local inpath 
[localhost][biadmin] 2> '/home/neha/test_data/shopping/CUSTOMER_DETAILS.dat' 
[localhost][biadmin] 3> overwrite into table CUSTOMER_DETAILS;
ok. (total: 5.54s)

[localhost][biadmin] 1> select * from CUSTOMER_DETAILS;
+------------+------------+-----------+--------------------+----------------------------+
| customer_i | first_name | last_name | address            | contact_number             |
| d          |            |           |                    |                            |
+------------+------------+-----------+--------------------+----------------------------+
| CUST001    | Neha       | Tomar     | {CMH Road,.......} | [998891234, 998000000]     |
| CUST002    | Mrudula    | Madiraju  | {HSR Layout,.... } | [988881212, 804448008]     |
| CUST003    | Madhavi    | Shankar   | {Shanthi Colony,..}| [977777777]                |
| CUST004    | Ravindra   | Bajpai    | {80 Ft Road,...}   | [988822222, 922221111..}   |
+------------+------------+-----------+--------------------+----------------------------+
4 rows in results(first row: 0.11s; total: 0.12s)

The table ORDER_DETAILS contains a nested column ITEMS that has a struct data type defined inside an array type column. The table fields are delimited by commas, and the collection fields by pipe symbols. To separate the struct attributes ITEM_ID, QUANTITY, and PRICE, use the delimiter Control character C (^C) to define the third-level delimiter for table data.

Sample row for ORDER_DETAILS table

		ORD001,CMH Road|Indira Nagar|560008|Bangalore|Karnataka|India|neha@in.ibm.com
|998000000,ITEM001^C2^C110|ITEM010^C4^C500,2013-07-29 15:35:40.356700,CUST002

The table can be populated using load utility.

Populating table ORDER_DETAILS

		[localhost][biadmin] 1> load hive data local inpath 
[localhost][biadmin] 2> '/home/neha/test_data/shopping/ORDER_DETAILS.dat' 
[localhost][biadmin] 3> overwrite into table ORDER_DETAILS;
ok. (total: 4.97s)

[localhost][biadmin] 1> select * from ORDER_DETAILS;
+----------+-----------------+---------------------+------------------------+------------+
| order_id | shipping_address| items               | order_date             | customer_id|
|          |                 |                     |                        |            |
+----------+-----------------+---------------------+------------------------+------------|
| ORD001   | {CMH Road....}  | [{ITEM001, 2, ..}..]| 2013-07-29 15:35:40.356| CUST002    |
| ORD002   | {80 Ft Road...} | [{ITEM111, 1, ..}..]| 2013-08-01 11:10:00.004| CUST001    |
+----------+-----------------+---------------------+------------------------+------------+
2 rows in results(first row: 8.27s; total: 8.28s)

Sample queries

Use these examples to create queries for your situation.

Query to fetch ORDER_ID and CITY of shipping

		[localhost][biadmin] 1> select ORDER_ID, SHIPPING_ADDRESS.CITY as SHIPPING_CITY 
[localhost][biadmin] 2> from ORDER_DETAILS;
+----------+---------------+
| order_id | SHIPPING_CITY |
+----------+---------------+
| ORD001   | Bangalore     |
| ORD002   | Bangalore     |
+----------+---------------+
2 rows in results(first row: 0.30s; total: 0.31s)

Query to fetch CUSTOMER_ID and first CONTACT_NUMBER

		[localhost][biadmin] 1> select CUSTOMER_ID,CONTACT_NUMBER[1] as CONTACT_NUMBER 
[localhost][biadmin] 2> from CUSTOMER_DETAILS;
+-------------+----------------+
| customer_id | CONTACT_NUMBER |
+-------------+----------------+
| CUST001     |      998891234 |
| CUST002     |      988881212 |
| CUST003     |      977777777 |
| CUST004     |      988822222 |
+-------------+----------------+
4 rows in results(first row: 0.8s; total: 0.8s)

Note: As of InfoSphere BigInsights 2.1, nested struct (struct within struct) is not supported. Also, fetching a particular element of array within struct type column (nested) is not supported.


Working with JSON SerDe for complex data types

Similar to Apache Hive, Big SQL also supports the use of custom SerDe to handle different data formats. To use custom SerDe blocks to create tables, use the create table statement clause ROW FORMAT SERDE 'serde.class.name'. It is easier to represent complex data structures in terms of JSON format since the convention used for JSON data is similar to programming languages. There are several other advantages of using the JSON data format, including being lightweight and easier to parse.

Please note that a user can define custom SerDe based on specific requirements by implementing the Apache Hive interface for SerDe "org.apache.hadoop.hive.serde2.SerDe". To work with Big SQL, we need to include the custom SerDe Java class in the Big SQL classpath. Refer to the InfoSphere BigInsights Information Center for details (see Resources).

To demonstrate the use of JSON SerDe, we'll create a table WISHLIST for our shopping cart scenario which contains CUSTOMER_ID and an array of ITEM_ID to keep the list of items a customer wants to keep in the wish list. See Resources to download the SerDe block used in the example can be downloaded from GitHub.

Table 3. Structure of table WISHLIST
Column nameColumn data type
CUSTOMER_IDvarchar(15)
ITEMS_LISTarray<varchar(15)>

Creating table WISHLIST using Big SQL

		[localhost][biadmin] 1> create table WISHLIST
[localhost][biadmin] 2> (
[localhost][biadmin] 3> CUSTOMER_ID varchar(15),
[localhost][biadmin] 4> ITEM_LIST array<varchar(15)>
[localhost][biadmin] 5> )
[localhost][biadmin] 6> row format serde 'org.openx.data.jsonserde.JsonSerDe'
[localhost][biadmin] 7> stored as textfile;
0 rows affected (total: 2.86s)

Sample row for WISHLIST table in JSON format

		  {"CUSTOMER_ID":"CUST001","ITEM_LIST":["ITEM001","ITEM008"]}

Populating table with JSON format data

		[localhost][biadmin] 1> load hive data local inpath 
[localhost][biadmin] 2> '/home/neha/test_data/shopping/wishlist.json' 
[localhost][biadmin] 3> overwrite into table WISHLIST;
ok. (total: 15.5s)

Sample queries

Use these examples to build your own custom queries.

Queries to fetch data from the table WISHLIST

		[localhost][biadmin] 1> select * from WISHLIST;
+-------------+-----------------------------+
| customer_id | item_list                   |
+-------------+-----------------------------+
| CUST001     | [ITEM001, ITEM008]          |
| CUST008     | [ITEM022, ITEM089, ITEM005] |
+-------------+-----------------------------+
2 rows in results(first row: 8.27s; total: 8.28s)

[localhost][biadmin] 1> select CUSTOMER_ID,ITEM_LIST[1] as FIRST_ITEM from WISHLIST;
+-------------+--------------+
| customer_id | FIRST_ITEM   |
+-------------+--------------+
| CUST001     | ITEM001      |
| CUST008     | ITEM022      |
+-------------+--------------+
2 rows in results(first row: 8.19s; total: 8.19s)

Summary

The InfoSphere BigInsights Big SQL interface enables you to process large volumes of data stored in a distributed file system such as HDFS or GPFS using extended data types. Use the sample code and sample queries in this article to practice using Big SQL data types with your own data.


Download

DescriptionNameSize
Sample codeComplexDataTypeQueries.zip2KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=945837
ArticleTitle=Working with Big SQL extended and complex data types
publish-date=09242013