Parallel processing of unstructured data, Part 2: Use AWS S3 as an unstructured data repository

With DB2 for Linux, UNIX and Windows

This series explores how to process unstructured data in parallel fashion — within a machine and across a series of machines — using the power of IBM DB2® for Linux®, UNIX® and Windows® (LUW) and GPFS™ shared-nothing cluster (SNC) to provide efficient, scalable access to unstructured data through a standard SQL interface. In this article, learn to provide access to unstructured data stored on the cloud. Also see how to analyze the data in a highly parallel fashion using an SQL interface provided by DB2 LUW and a table function included in this article.

Steve Raspudic (stevera@ca.ibm.com ), PureData Provisioning Architect, IBM

Steve RaspudicSteven Raspudic is leading the provisioning team for the PureData for Transactions appliance and has played a key technical role in the delivery of numerous DB2 technologies, including the DB2 High Availability infrastructure and the DB2 High Availability and Disaster Recovery (HADR) technology. He is the holder of several patents in the areas of relational databases and database availability.



Alexander Abrashkevich (aabrashk@ca.ibm.com), PureData Provisioning Development, IBM

Alexander AbrashkevichAlexander Abrashkevich joined the IBM Toronto Lab in 2000. Since then, he worked on various projects in DB2 UNIX Porting team, DB2 RASPD (Reliability Availability Serviceability and Problem Determination) team, DB2 Health Advisor team, and DB2 DPS (Data Protection Services) team. Since 2011, he has been working on IBM cloud solutions -- first as a member of the DBaaS (Database as a Service) team, then as a lead developer on the PureData System for Transactions provisioning team. He is the holder of several patents and has numerous publications.



Toni Kunic (tonik@ca.ibm.com), PureData Provisioning Development, IBM

Toni KunicToni joined the PureData team in May of 2013. Since that time, he has been engaged in writing core SAN provisioning code for the appliance.



13 March 2014

Also available in Russian

Overview

Part 1 of this series discussed SQL access to unstructured data residing in files locally accessible to the DB2 LUW engine with the SQL interface provided via the DB2 Generic Table Function feature. This allowed the expressive power of SQL to be applied against large unstructured data sets and allows that data to be joined, aggregated, and ingested into a DB2 data warehouse. The parallel and scalable parsing of the unstructured data was performed with a Java language user-defined function (UDF). UDFs are extensions or additions to the existing built-in functions of the SQL language. A benefit of the UDF architecture of DB2 LUW is that it is relatively easy to extend the functions of the UDF.

Learn which edition of IBM DB2 for Linux, UNIX and Windows is right for you.

This article explains how to use the unstructured data reader to access data stored on cloud storage. (In Part 1, we used General Parallel File System (GPFS) storage to store the large unstructured data sets.) This article builds on the background material and fundamentals discussed in Part 1: parallel processing, unstructured data, serialization formats, the CSV data serialization format, the Jackson parser library, and information regarding the DB2 LUW Generic Table Function.

You can download the sample code used in this article.


Cloud storage

This section briefly discusses cloud storage and a high level architecture of how the unstructured data stored on the cloud is accessed for processing in a parallel fashion within the DB2 LUW SQL engine.

The cloud store for the examples in this article is provided via Amazon Simple Storage Service (S3). Files stored in this cloud store are called objects and their names are also called keys. Files stored in this cloud store are called objects, and their names are also called keys. A container for objects is called a bucket, and each object belongs to exactly one bucket. Buckets have some important characteristics:

  • Bucket names must be unique amongst all users of S3.
  • Buckets cannot be nested into a deeper hierarchy.
  • The name of a bucket can only consist of basic alphanumeric characters plus dot (.) and dash (-).

Public data can be accessed without any access credentials via HTTP[S]. Access credentials, consisting of a Access Key ID and a Secret Access Key, are required to access non-public data.

Bucket data is accessed similarly to how you access local filesystem data. In this case it is with a bucket name and object name, which uniquely identify the file to be read. The access credentials can be supplied to ensure the appropriate permissions exist to read the data.


AwsCredentials.properties file

To supply the needed access credentials to be granted access to data, the unstructured data reader looks for a file named AwsCredentials.properties. This file must be readable by the instance owner and be in the following format.

% cat $HOME/AwsCredentials.properties

# Fill in your AWS Access Key ID and Secret Access Key
# http://aws.amazon.com/security-credentials
accessKey = putYourAccessKeyHere
secretKey = putYourSecretKeyHere

Ensure that the file is readable by the instance owner only in order to prevent unauthorized access to the key file. For example, to protect the integrity of these keys, it is recommended that instance owners use the following command.

chmod 600 $HOME/AwsCredentials.properties

Reader architecture overview

The unstructured data reader parses input that's supplied in JavaScript Object Notation (JSON) to determine the location of the data and obtain the needed access credentials. For example, the following JSON describes data stored within a bucket named bucketName, an object named file1.csv, and access credentials stored within a file named /home/db2inst1/AwsCredentials.properties:

{"filename":"file1.csv","s3bucket":"bucketName",
"awsCredentials":"/home/db2inst1/AwsCredentials.properties"}

The unstructured data reader parallelizes file access, file reading, and parsing as well as processing of the unstructured data. The degree of parallelism is determined by the number of partitions resident in the DB2 LUW instance in question. The parallelism happens automatically and transparently in response to any queries using the unstructured reader.

Figure 1 shows how the input file is partitioned. The object to be read is named file1.csv and it resides within the bucket bucketName.

The input file is partitioned into four non-overlapping blocks of data, where four is the number of DB2 LUW partitions in this instance. Recall that this partitioning occurs dynamically at the run time of the SQL query and is transparent to the query itself.

Figure 1. Access to unstructured data stored in cloud
Access to Unstructured Data Stored on Cloud

Uploading the data set to be analyzed to Amazon S3

In the following example, we'll analyze the same airports CSV data set (downloaded from OurAirports.com) as used in Part 1 of this series. Assuming that the data set is accessible to a local filesystem, upload it to the cloud storage to prepare for processing.

There are many tools and utilities for moving data into Amazon S3. The AWS Management console is a simple graphical method you can use to create the required bucket and upload the data to the bucket.

To follow along with the next example, create a bucket named sashybucket and upload the airports.csv file to that bucket and into a file named airports.csv.


Installing and configuring the reader

Installing and configuring the reader is the same as in Part 1 and is repeated here for convenience. Download the archive file accompanying this article. Extract the contents into a location accessible to the instance owner. Take a moment to review the README file. To get up and running quickly, copy the following executable code to the relevant directory (assuming the instance owner is named db2inst1, the home directory of the instance owner is /home/db2inst1, and the directory where the archive has been extracted is ~/tmp).

cp /home/db2inst1/tmp/bin/* /home/db2inst1/sqllib/function/

Register the function in DB2 LUW by establishing a connection to a database, which will host the table function, and issuing the following command.

db2 connect to <dbname>
db2 "call sqlj.install_jar('file:${HOME}/sqllib/function/db2sashyReader.jar','db2sashyReader')"
db2 "call sqlj.refresh_classes()"  

db2 CREATE FUNCTION SASHYCSV(VARCHAR(255)) RETURNS GENERIC TABLE EXTERNAL NAME\
'COM.ibm.db2.app.sashy.CsvTableFunction!retrieveRow' LANGUAGE Java SPECIFIC\
sashyCSV PARAMETER STYLE DB2GENERAL VARIANT NOT FENCED THREADSAFE NOT NULL\
CALL NO SQL NO EXTERNAL ACTION NO SCRATCHPAD FINAL CALL ALLOW PARALLEL\
EXECUTE ON ALL DATABASE PARTITIONS RESULT TABLE DISTRIBUTED DBINFO

Running queries

This section describes how to run two different queries.

Determine the table-reference and typed-correlation-clause

When constructing SQL that uses the unstructured data reader, there are two important elements: the table-reference specifier and the typed-correlation-clause.

The table-specifier is a JSON format data structure that identifies the data to be read and identifies the keys required to read the data. For example, the following table-specifier is appropriate for the sample environment (a bucket named sashybucket, the data file named airports.csv, and access credentials resident in the file /home/db2inst1/AwsCredentials.properties).

table(SASHYCSV('{"filename":"airports.csv","s3bucket":"sashybucket",
"awsCredentials":"/home/db2inst1/AwsCredentials.properties"}'))

The typed-correlation-clause specifies the columns and column types to which the input data set is mapped. Wherever possible, the data types should be mapped to the closest corresponding DB2 LUW SQL data type. Upon inspection of the airports data set, the following typed-correlation-clause is chosen.

as (id int, ident varchar(255), type varchar(255), name varchar(255),\
latitude_deg double,longitude double, elevation_ft int, 
continent varchar(255),iso_country varchar(255), 
iso_region varchar(255), municipality varchar(255),
scheduled_service varchar(255), gps_code varchar(255), 
iata_code varchar(200),local_code varchar(200), 
home_link varchar(250), wikipedia_link varchar(250),
keywords varchar(250))

Insert into base DB2 LUW table select from unstructured cloud-resident data

Now that access to the input data is known and the logical table definition has been constructed, for a first query let's write SQL to populate a table named airports with all rows from the source data that's resident on the cloud.

If the target DB2 LUW table does not yet exist, create it with the following SQL.

create table airports (id int, ident varchar(255),
 type varchar(255), name varchar(255),latitude_deg double,longitude double, 
elevation_ft int, continent varchar(255), iso_country varchar(255), 
iso_region varchar(255), municipality varchar(255), 
scheduled_service varchar(255), gps_code varchar(255), 
iata_code varchar(200), local_code varchar(200), 
home_link varchar(250), wikipedia_link varchar(250), 
keywords varchar(250));

Populate the DB2 LUW table with the following code.

insert into airports select * from table(SASHYCSV('{"filename":"airports.csv","s3bucket":"sashybucket",
"awsCredentials":"/home/db2inst1/AwsCredentials.properties"}')) as 
(id int, ident varchar(255), type varchar(255), 
name varchar(255),latitude_deg double,longitude double, 
elevation_ft int, continent varchar(255), iso_country varchar(255), 
iso_region varchar(255), municipality varchar(255), 
scheduled_service varchar(255), gps_code varchar(255), 
iata_code varchar(200), local_code varchar(200), home_link varchar(250), 
wikipedia_link varchar(250), keywords varchar(250));

The table-reference specifier and the typed-correlation-clause are used within the query. This SQL is a type of "insert from select" and illustrates one way to use the unstructured data reader. We're streaming unstructured data directly from the cloud and landing it in an internal DB2 LUW table.


Scaleout

Owing to the architecture of the reader, as a greater number of partitions make up a DB2 DPF instance the greater the degree of parallelism achieved. The parallelism is beneficial for additional aggregate bandwidth capacity to the cloud and for access to additional compute resources. In general, these properties will reduce the elapsed time / increase throughput for queries as the number of partitions in a particular DB2 LUW instance is increased.

Let's start with a single partition example, as shown below.

% cat ~/sqllib/db2nodes.cfg
0 host1 0

Execute a simple query. With the given configuration of the database partitions, the query runs as follows.

select count(*) from table(SASHYCSV('{"filename":"l2_100k.csv","s3bucket":"sashybucket",
"awsCredentials":"/home/hotellnx99/db2inst1/AwsCredentials.properties",
"logFileName":"/tmp/annoying","debug":"true"}')) as 
TX(logDate DATE, 
logTime TIME, encryptedUserIP VARCHAR(256), 
institutionResolverID BIGINT, routerRedirectIdentifier    
VARCHAR(1024), aulast VARCHAR(1024), 
aufirst VARCHAR(1024), auinit  VARCHAR(1024), 
auinit1 VARCHAR(1024), auinitm VARCHAR(1024), au  VARCHAR(1024), 
aucorp  VARCHAR(1024), atitle  VARCHAR(1024), title   VARCHAR(1024), 
jtitle  VARCHAR(1024), stitle VARCHAR(1024), date VARCHAR(1024), 
ssn VARCHAR(1024), quarter VARCHAR(1024), volume  VARCHAR(1024), 
part    VARCHAR(1024), issue   VARCHAR(1024), spage VARCHAR(1024), 
epage VARCHAR(1024), pages   VARCHAR(1024), artnum VARCHAR(1024), 
issn VARCHAR(1024), eissn   VARCHAR(1024), isbn VARCHAR(1024), 
coden VARCHAR(1024), sici    VARCHAR(1024), genre VARCHAR(1024), 
btitle VARCHAR(1024), place   VARCHAR(1024), pub VARCHAR(1024), 
edition VARCHAR(1024), tpages  INTEGER, series  VARCHAR(1024))

1          
-----------
      100000

Add three partitions by running the following commands as the instance owner.

db2 START DBM DBPARTITIONNUM 1 ADD DBPARTITIONNUM HOSTNAME host1 PORT 1
db2 START DBM DBPARTITIONNUM 2 ADD DBPARTITIONNUM HOSTNAME host1 PORT 2
db2 START DBM DBPARTITIONNUM 3 ADD DBPARTITIONNUM HOSTNAME host1 PORT 3

Run the same query as previously.

% cat $HOME/sqllib/db2nodes.cfg

0 host1 0
1 host1 1
2 host1 2
3 host1 3

Issue the same query.

select count(*) from table(SASHYCSV('{"filename":"l2_100k.csv","s3bucket":"sashybucket",
"awsCredentials":"/home/hotellnx99/db2inst1/AwsCredentials.properties","logFileName":"/tmp/annoying",
"debug":"true"}')) as TX(logDate DATE, logTime TIME, encryptedUserIP VARCHAR(256), 
institutionResolverID BIGINT, routerRedirectIdentifier    VARCHAR(1024), aulast VARCHAR(1024), 
aufirst VARCHAR(1024), auinit  VARCHAR(1024), auinit1 VARCHAR(1024), auinitm VARCHAR(1024), 
au  VARCHAR(1024),
aucorp  VARCHAR(1024), atitle  VARCHAR(1024), title   VARCHAR(1024), jtitle  VARCHAR(1024), 
stitle VARCHAR(1024), date VARCHAR(1024), ssn VARCHAR(1024), quarter VARCHAR(1024), 
volume  VARCHAR(1024), part    VARCHAR(1024), issue   VARCHAR(1024), spage VARCHAR(1024), 
epage VARCHAR(1024), pages   VARCHAR(1024), artnum VARCHAR(1024), issn VARCHAR(1024), eissn  
 VARCHAR(1024), isbn VARCHAR(1024), coden VARCHAR(1024), sici    VARCHAR(1024), genre VARCHAR(1024), 
btitle VARCHAR(1024), place   VARCHAR(1024), pub VARCHAR(1024), edition VARCHAR(1024), tpages  INTEGER, 
series  VARCHAR(1024))

1          
-----------
      100000

The only difference should be a reduction in elapsed time. In this case, the elapsed time was reduced by a factor of three compared to the first run.


Conclusion

This article demonstrated how to store unstructured data on the cloud. It also showed how the data can be analyzed in a highly parallel fashion using an SQL interface provided by DB2 LUW and the table function included in this article.

In the next and final article in this series, see how the framework can be extended to analyze other unstructured data formats. This will allow scalable, SQL-driven access to a variety of unstructured data.


Download

DescriptionNameSize
Sample codesashyReader-1.0.1.tgz13000KB

Resources

Learn

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=965501
ArticleTitle=Parallel processing of unstructured data, Part 2: Use AWS S3 as an unstructured data repository
publish-date=03132014