Parallel processing of unstructured data, Part 1: With DB2 LUW and GPFS SNC

Introduction

Learn how unstructured data can be processed in parallel fashion — within a machine and across a series of machines — by leveraging IBM DB2® for Linux®, UNIX®, and Windows® and GPFS™ shared-nothing cluster (SNC) to provide efficient highly scalable access to unstructured data, all through a standard SQL interface. Realize this capability with clusters of commodity hardware, suitable for provisioning in the cloud or directly on bare metal clusters of commodity hardware. Scalability is achieved within the framework via the principle of computation locality. Computation is performed local to the host that has direct data access, thus minimizing or eliminating network bandwidth requirements and eliminating the need for any shared compute resource.

Steve Raspudic (stevera@ca.ibm.com ), PureData Provisioning Architect, IBM

Steve RaspudicSteven Raspudic is leading the provisioning team for the PureData for Transactions appliance and has played a key technical role in the delivery of numerous DB2 technologies, including the DB2 High Availability infrastructure and the DB2 High Availability and Disaster Recovery (HADR) technology. He is the holder of several patents in the areas of relational databases and database availability.



Alexander Abrashkevich (aabrashk@ca.ibm.com), PureData Provisioning Development, IBM

Alexander AbrashkevichAlexander Abrashkevich joined the IBM Toronto Lab in 2000. Since then, he worked on various projects in DB2 UNIX Porting team, DB2 RASPD (Reliability Availability Serviceability and Problem Determination) team, DB2 Health Advisor team, and DB2 DPS (Data Protection Services) team. Since 2011, he has been working on IBM cloud solutions -- first as a member of the DBaaS (Database as a Service) team, then as a lead developer on the PureData System for Transactions provisioning team. He is the holder of several patents and has numerous publications.



30 January 2014

Also available in Russian

This article demonstrates a Java™ technology-based framework, which leverages the architectural features available in IBM DB2 for Linux, UNIX, and Windows (DB2 LUW) and the General Parallel File System (GPFS). The framework provides for the parallel and scalable processing of unstructured data via an SQL interface. Such a capability is useful for data cleansing, data aggregation, and other tasks requiring the scanning, processing, and aggregation of large unstructured data sets. Furthermore, this capability can be realized with clusters of commodity hardware, suitable for provisioning in the cloud or directly on bare metal clusters of commodity hardware. In particular, each compute host requires access only to local direct attached storage (DAS), compute and commodity TCP/IP network connectivity. Scalability is achieved within the framework via the principle of computation locality. Computation is performed local to the host, which has direct data access, thus minimizing or eliminating network bandwidth requirements and eliminating the need for any shared compute resource. Such an approach allows for effective scaling because as compute resources are added to the cluster, shorter elapsed times for given queries can be achieved, and the reduction in elapsed times can be proportional to the net addition of compute resources added to the compute cluster.

Overview

The field of SQL-based frameworks for the processing of large unstructured data sets in a scalable fashion has attracted considerable interest in recent years. This article will focus on SQL access to unstructured data sets residing on files accessible to the DB2 LUW engine, with a SQL interface provided via the DB2 LUW generic table functions. Examples will be demonstrated to show the applicability of the work in various environments commonly used to host DB2 LUW data warehouses, in single- and multi compute-node environments.

Knowledge of DB2 LUW, including the partitioned variants (DB2 LUW DPF), as well as some knowledge of clustered file systems, and data serialization formats, is assumed, as is knowledge of the Java programming language and its use within the context of DB2 LUW Java generic table functions.

We will provide a brief overview of these topics; see the Resources section for more detailed background information if required.


Parallel processing

Parallel processing is the key to effective scaling. Scaling itself refers to the concept that, as compute resources are added, execution times for computation are reduced in proportion to the additional compute resources added to the cluster. Scaling can occur within two primary dimensions. Scaling as an individual compute resource (as defined by a single operating system instances) is made larger as more compute resources are added to the cluster (adding more operating system instances). The terms vertical,scale-up, and SMP scaling are used to describe the former case. The terms horizontal,scale-out, and MPP scaling are used to describe the latter case.

The DB2 LUW engine can effectively scale in both dimensions, and that capability will be demonstrated in the examples to follow.


DB2 LUW generic table functions

A table function is defined here as a logical database object that returns a table object (consisting of zero or more rows).

A generic table function is a function where the output table is not specified when the function is defined. Instead, the output table is specified when the table function is referenced. Different output tables are possible for the same generic table function, depending on different input arguments. Consider that a relational table is being logically projected over the data set.

This capability allows a single table function to handle an entire class of input formats; the input formats differ only in details such as number of columns and the definitions of those columns.


Unstructured data

Unstructured data is a general term that has arisen to describe any and all information that is not structured, which means not stored within a highly structured schema-based RDMS. Unstructured data can refer to textual or non-textual data. We will concentrate on textual unstructured data.


Serialization formats

Unstructured data is in some sense a misnomer. Data must be structured in some way so as to be rendered intelligible. In the context of both data storage and data transmission, serialization is the process of translating data structures or objects into a format that can be stored (stored in a file or transmitted across a network connection, for example) and read back later in the same or another computer environment.

There are many serialization formats in use today. Comma-separated values (CSV) is one of the most common and will be the specific serialization format discussed here. Other widely used serialization formats typically used to store unstructured, semi-structured, or structured data include XML and JSON, and processing those via SQL will be the topic of future articles in this series.


CSV

A CSV file stores tabular data (numbers and text) in plain-text form. Plain text means that the file is a sequence of characters with no data that must be interpreted. A CSV file consists of any number of records, separated by line breaks of some kind; each record consists of fields, separated by some other character or string, most commonly a literal comma or tab. For more information about the CSV format, please consult RFC 4180.


The Jackson parser library

To process unstructured data, a parser must exist that can interpret the serialization format of the unstructured data in question (and ultimately, to then convert the unstructured data into rows and columns of various SQL data types). Data parsing is a relatively mature and well-understood field, and there are numerous CSV parser libraries available. The Jackson parser library is one such example. It is a general-purpose parser library written in the Java language and available for easy integration in various Java-based environments.

Future articles will discuss how to leverage other third-party parsers to allow for the processing of data stored in other serialization formats — such as XML, JSON, and others.


DB2 LUW single host with local file system

Local file systems are typically utilized when the DB2 LUW query engine does not span more than one operating system. Any file system supported by DB2 LUW can be used as the system to store the unstructured data. For a full list of supported file systems, consult the section "Recommended file systems" in the DB2 10.5 for Linux, UNIX, and Windows Information Center.

Figure 1. Single host
Image shows single host

DB2 LUW DPF cluster with IBM GPFS

The DB2 LUW DPF cluster allows a group of machines to act as a single database engine and is typically used in support of large data warehouses. The database engine typically runs in conjunction with the GPFS file system. GPFS allows a group of computers concurrent access to shared data, and it provides a single cluster-wide name space with standard file system POSIX semantics, deployed in one of three typical configuration topologies:

  • Cluster nodes with back-end storage controllers (Network Shared Disk (NSD))
  • Cluster nodes with local private storage (Shared-Nothing Cluster (SNC))
  • Cluster nodes connected to a shared Storage Area Network (SAN)

We'll concentrate on the latter two topologies.

Figure 2. Multiple hosts with shared file system — Direct-attached disks
Image shows multiple hosts with shared file system -- Direct attached disks
Figure 3. Multiple hosts with shared file system — SAN-attached disks
Image shows multiple hosts with shared file system -- SAN-attached disks

Sample environments overview

The use of the sample code will be discussed via a set of three examples. For each example configuration, it is assumed that the code is placed in the home directory of the instance owner and the README has been followed to properly install the Java table function. The examples are chosen to be broadly representative of popular DB2 LUW topologies and to illustrate working with large, publicly available CSV data sets in such environments.

The first sample will focus on a DB2 LUW 10.5 instance hosted on a single-instance Linux environment. Then, by extending the DB2 instance environment and adding more partitions (processing engines), demonstrate the ability to parallelize in box (that is, scale up) without any changes to the data repository (the data file containing the CSV data). SQL statements will be issued against the unstructured CSV format data and illustrate the reduction in elapsed time as partitions are added.

Next, we will discuss a multiple-machine DB2 LUW 10.5 DPF Linux cluster. Each node (machine) in the cluster hosts direct-attached disk only and provides a commodity network connection to all other machines. A different and larger publicly available data set will be used and execute powerful SQL statements for aggregation of the data set.

The final sample will focus on a large multi-node AIX machine data warehouse, with GPFS file system data stored on a large central SAN. In this last example, DB2 LUW 10.1 is used. Another large, publicly available data set is used in this example to illustrate the range of SQL-based queries that can be executed.

In all examples, in a desire to demonstrate real-world applicability, public domain and publicly available data sets are used. The URLs for the data sets are included so the examples can be replicated by the reader. However, any data set can be utilized.


Sample datasets overview

For the first sample, airport data is used (downloaded from OurAirports.com). The site hosts a CSV-formatted data dump of all known airports, countries, and regions and is updated nightly. Please refer to the URL for more background information regarding the data set. In the subsequent examples, the term airports data will refer to data obtained from the above-referenced website.

In the second example, data from the OpenURL Router Data project is used. The OpenURL Router Data is data describing traffic flowing through the U.K. OpenURL Router; this data is sometimes referred to as activity data. Note that the data are made available under the Open Data Commons (ODC) Public Domain Dedication and License, and the ODC Attribution Sharealike Community Norms. Please read the terms of the PDDL and the Attribution Sharealike Community Norms before using the data. In the subsequent examples, the term OpenURL data will refer to data downloaded from the above source.

Finally, for the third example, data from Wikimedia dumps is used. The data is publicly available from:

Wikimedia.org. The particular data set used describes hourly page views per article for approximately 30 million articles. See the referenced URL for more background information.


Configuring and installing the reader

Download the archive file accompanying this article. Extract the contents into a location accessible to the instance owner. Take a moment to review the README file. To get up and running quickly, do the following (assuming the instance owner is named db2inst1, the home directory of the instance owner is /home/db2inst1, and the directory where the archive has been extracted is ~/tmp). First, copy the executable code to the relevant directory:

cp /home/db2inst1/tmp/bin/* /home/db2inst1/sqllib/function/

Now register the function in DB2 LUW, by establishing a connection to a database which will host the table function and issuing the following command:

db2 connect to <dbname>
db2 "call sqlj.install_jar('file:${HOME}/sqllib/function/db2sashyReader.jar','db2sashyReader')"
db2 "call sqlj.refresh_classes()"  

db2 CREATE FUNCTION SASHYCSV(VARCHAR(255)) RETURNS GENERIC TABLE EXTERNAL NAME\ 
'COM.ibm.db2.app.sashy.CsvTableFunction!retrieveRow' LANGUAGE Java SPECIFIC\
sashyCSV PARAMETER STYLE DB2GENERAL VARIANT NOT FENCED THREADSAFE NOT NULL\
CALL NO SQL NO EXTERNAL ACTION NO SCRATCHPAD FINAL CALL ALLOW PARALLEL\
EXECUTE ON ALL DATABASE PARTITIONS RESULT TABLE DISTRIBUTED DBINFO

You can now use the new Java function to analyze your CSV-formatted data.


Sample environment 1: Linux single system

This example will use a single partition on Linux in a DB2 LUW 10.5 ESE environment.

First, download the data from OurAirports.com. Place the data file on a file system accessible to the DB2 LUW instance. In this example, the data will reside in /localfs/airports.csv. The data set (at the time of this writing) has the following characteristics in terms of file size and number of rows, respectively:

cd /localfs;du -h airports.csv;wc -l airports.csv

6.6M    airports.csv
45409   airports.csv

Next, examine the file and map the attribute values of the data set to the corresponding DB2 LUW SQL data types. Consider that a relational table is being logically projected over the CSV data set, and map the columns and column types appropriately for the data set in question. Consult the documentation for supported DB2 LUW SQL data types.

When constructing SQL that utilizes the DB2 LUW generic table function, there are two important elements: the table-reference specifier and the typed-correlation-clause. For full details on constructing SQL queries, please consult the DB2 LUW Information Center. Here, we'll present a small subset to illustrate the features of using the supplied table function.

The table-specifier is responsible for specifying the name of the table function (in this case, SASHYCSV) and the name of the input file:

table(SASHYCSV('{"filename": "/localfs/airports.csv"}'))

The typed-correlation-clause specifies the columns and columns types to map the input data set to. Note that wherever possible, the data types are mapped to the closest corresponding DB2 LUW SQL data type:

as (id int, ident varchar(255), type varchar(255), name varchar(255),\
latitude_deg double,longitude double, elevation_ft int, continent varchar(255),\
iso_country varchar(255), iso_region varchar(255), municipality varchar(255),\
scheduled_service varchar(255), gps_code varchar(255), iata_code varchar(200),\
local_code varchar(200), home_link varchar(250), wikipedia_link varchar(250),\
keywords varchar(250))

Now that the input file name and the logical table definition have been constructed, let us identify only those airports whose elevation is greater than 14,000 feet. Putting all this together, the following SQL is constructed:

select name from table(SASHYCSV('{"filename": "/localfs/airports.csv"}'))\
as (id int, ident varchar(255), type varchar(255), name varchar(255),\
latitude_deg double,longitude double, elevation_ft int, continent varchar(255),\
iso_country varchar(255), iso_region varchar(255), municipality varchar(255),\
scheduled_service varchar(255), gps_code varchar(255), iata_code varchar(200),\
local_code varchar(200), home_link varchar(250), wikipedia_link varchar(250),\
keywords varchar(250))\
where elevation_ft > 14000;

The query produces the following output, which identifies all airports with elevation greater than 14,000 feet from the CSV data set:

NAME                                                                               
----------------------------------
Siachen Glacier AFS Airport                                                        
Daulat Beg Oldi Advanced Landing Ground                                            
Ulla Ulla Airport                                                                  
Fausa Airport                                                                      
Laguna Choclococha Airport                                                         
San Rafael Airport                                                                 
Ngari Gunsa Airport                                                                
Qamdo Bangda Airport                                                               

  8 record(s) selected.

Now let's demonstrate the ability to add processing engines in an online manner without repartitioning the input data set. For complete background on the process of adding partitions to a DB2 LUW instance in an online fashion, please refer to the Information Center. Grow the instance from a single partition instance to a four-partition instance.

The db2nodes.cfg for the instance looks initially as follows (the hostname on which the DB2 LUW engine resides is named hotellnx98):

0 hotellnx98 0

Let's add three partitions by executing the following commands as the instance owner:

db2 START DBM DBPARTITIONNUM 1 ADD DBPARTITIONNUM HOSTNAME hotellnx98 PORT 1;
db2 START DBM DBPARTITIONNUM 2 ADD DBPARTITIONNUM HOSTNAME hotellnx98 PORT 2;
db2 START DBM DBPARTITIONNUM 3 ADD DBPARTITIONNUM HOSTNAME hotellnx98 PORT 3;

Now, without any other changes, issue the SQL query given previously (as per above). The result set will be identical to that seen previously, and a reduction of elapsed time should be observed, especially in OS environments hosting multiple CPU cores.

Figure 4. Sample environment, single Linux host
Image shows sample environment, single Linux host

Sample environment 2: Linux multiple systems using direct attached storage

This example will be based on a multiple partition DB2 LUW 10.5 on Linux. The DB2 LUW instance will be a 64-partition instance spread across a cluster of four commodity Linux machines. Data storage is provided by the GPFS SNC file system, and the storage is hosted locally within each compute node (direct attached storage only).

The configuration of the DB2 LUW and GPFS SNC clusters will not be considered in detail here, but follow best practices when configuring and provisioning such a cluster.

Download the data for analysis to the GPFS file system accessible to the DB2 LUW instance. In the example, the data will be downloaded directly to the file /gpfs1/pc-2013-10.csv. The data set (at the time of this writing) has the following characteristics in terms of file size and number of rows, respectively:

du -m pc-2013-10.csv;wc -l pc-2013-10.csv

5346M    pc-2013-10.csv
10496370 pc-2013-10.csv

Next, map the attribute values of the data set to the corresponding DB2 LUW SQL data types. Consider that one is projecting a relational table over the data set such that the defined relational table maps the values appropriately.

Examine the data set and determine the structure of the CSV format. That is, determine the delimiter character used, as well as any non-standard usage of quotes or escape characters. The table specifier is as follows and is responsible for specifying the name of the table function (in this case, the SASHYCSV) and the name of the input file (in this case, /gpfs1/pc-2013-10.csv). Note also that this data set uses the tab character for column separation; that information is also passed in via the table specifier. The table specifier then looks as follows:

table(SASHYCSV('{"filename":"/gpfs1/pc-2013-10.csv ","separator":"\t"}'))

Note that this table is particularly wide; that is, there are many columns in the table when considered in the relational sense. However, determining the column names and types is a straightforward process that proceeds via a manual inspection of a subset of the data.

The typed-correlation-clause is as follows and specifies the columns and columns types to map the input data set to. Note that wherever possible, the data types are mapped to the closest corresponding DB2 LUW SQL data type.

as TX(logDate DATE, logTime TIME,encryptedUserIP VARCHAR(256),\
institutionResolverID BIGINT, routerRedirectIdentifierVARCHAR(1024),\
aulast  VARCHAR(1024), aufirst VARCHAR(1024),\
auinit  VARCHAR(1024), auinit1 VARCHAR(1024),\
auinitm VARCHAR(1024), au  VARCHAR(1024),\
aucorp  VARCHAR(1024), atitle  VARCHAR(1024),title VARCHAR(1024),\
jtitle  VARCHAR(1024), stitle  VARCHAR(1024), date VARCHAR(1024),\
ssn VARCHAR(1024), quarter VARCHAR(1024), volume VARCHAR(1024),\
part    VARCHAR(1024),issue VARCHAR(1024),spage VARCHAR(1024),\
epage VARCHAR(1024),pages VARCHAR(1024),artnum  VARCHAR(1024),\
issn VARCHAR(1024),eissn VARCHAR(1024),isbn VARCHAR(1024),\
coden VARCHAR(1024), sici VARCHAR(1024), genre VARCHAR(1024),\
btitle VARCHAR(1024), place VARCHAR(1024), pub VARCHAR(1024),\
edition VARCHAR(1024), tpages  INTEGER, series  VARCHAR(1024),\
doi VARCHAR(1024),sid VARCHAR(1024), user_agent VARCHAR(1024),\
acc_lan VARCHAR(2048), acc_enc VARCHAR(1024),\
acc_char    VARCHAR(1024), x_forw  VARCHAR(1024),\
via VARCHAR(1024), hash_of_six VARCHAR(256))

With this data, we wish to do a simple analysis whereby we count the number of records present in the file and construct the appropriate SQL statement (we'll not repeat the table-specifier not typed-correlation-clause for brevity; they are as stated above):

select count(*) from <table-specifier> <typed-correlation-clause>;

The query produces the following output:

1
-----------
   10496370
Figure 5. Linux multiple systems using direct attached storage
Image shows Linux multiple systems using direct attached storage

Sample environment 3: AIX multi-system shared SAN

This example will utilize a multiple partition DB2 LUW 10.1 on AIX. This is a 256-partition instance spread across a cluster of 32 AIX LPARs. Data storage is provided by the GPFS file system, and the storage is hosted on a storage attached network (SAN).

The configuration of the DB2 LUW and GPFS clusters will not be considered in detail here, but follow best practices where configuring and provisioning such a cluster.

Download the data to analyze with the download location being the GPFS file system accessible to the DB2 LUW instance. In the example, the data will be downloaded directly to the file /gpfs1/pagecounts-2013-10.csv.

The table-specifier is as follows and is responsible for specifying the name of the table function (in this case, the SASHYCSV) and the name of the input file (in this case, /gpfs1/pagecounts-2013-10.csv), as well as the column delimiter, which we determined by inspection to be the space character):

table(SASHYCSV('{"filename":"/gpfs1/pagecounts-2013-10.csv","separator":" "}'))

The typed-correlation-clause is as follows and specifies the columns and columns types to map the input data set to. Note that wherever possible, the data types are mapped to the closest corresponding DB2 LUW SQL data type.

as TX(project varchar(32), page varchar(1024), \
monthly_total bigint, hourly_counts varchar(1024))

Let's count how many records are in the file:

select count(*) from <table-specifier> <typed-correlation-clause>;

1
-----------
  125374279


  1 record(s) selected

So, for the data in question, more than 125 million records exist, with each record recording an entry of the form project page monthly total/hourly total.

Let's further analyze to understand the average and maximum page views for each project. We'll also restrict our query to only projects with at least 2 million views for the month.

select project, count(page) as num_pages, avg(monthly_total) as avg_monthly, \
max(monthly_total) as max_monthly from <table-specifier> \
<typed-correlation-clause> \
group by project having count(page) > 2000000 order by project


PROJECT  NUM_PAGES   AVG_MONTHLY          MAX_MONTHLY
-------- ----------- -------------------- --------------------
commons.    11255529                   48            110406900
de.z         5397692                  212            181587923
en.z        36588775                  287           2425951790
es.z         4761666                  453            601658207
fr.z         5295556                  202            290705208
it.z         3249094                  205            164751928
ja.z         4591607                  261            164365682
nl.z         2535436                   94             38656060
pl.z         2852999                  207            115338727
pt.z         2775444                  266            218554258
ru.z         4554482                  275            320189885
zh.z         2572854                  191            114571504

  12 record(s) selected
Figure 6. Environment for sample 3
AIX Multiple Systems Using SAN Attached Storage

Conclusion and next steps

We have demonstrated how unstructured data can be processed in massively parallel fashion, both within a machine and across a series of machines by leveraging the features of DB2 LUW and GPFS SNC to provide efficient, highly scalable access to unstructured data, all through a standard SQL interface. This allows the expressive power of SQL to be applied against large unstructured data sets and allows that data to be joined, aggregated, and ingested into a DB2 data warehouse. We've further demonstrated that scale-up can be achieved online via the addition of DB2 partitions and that the architecture is suitable for clusters of commodity hardware.

In subsequent articles in this series, we'll consider applications of this approach to using other parsers, repositories, Hadoop applications, etc.


Download

DescriptionNameSize
Sample codesashyReader-1.0.1.tgz13000KB

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=961504
ArticleTitle=Parallel processing of unstructured data, Part 1: With DB2 LUW and GPFS SNC
publish-date=01302014