Integrate DB2 for z/OS with InfoSphere BigInsights, Part 1: Set up the InfoSphere BigInsights connector for DB2 for z/OS

Learn how to set up integration between IBM DB2 11 for z/OS and IBM InfoSphere BigInsights. Enable access to structured and non-structured data that is stored in the Hadoop Distributed File System and send the results back to DB2, where the data can be integrated with online transactional data. Using a scenario that is common to all DB2 for z/OS users, learn to create a big data solution that uses the user-defined functions JAQL_SUBMIT and HDFS_READ to run jobs on InfoSphere BigInsights and retrieve the results with SQL.

17 June 2014 - In Introduction, Summary, and Resources, added links to Part 2 of the series.

Share:

Jane Man (janeman@us.ibm.com), Senior Software Engineer, IBM

Jane ManJane Man is a senior software engineer and one of the original members of the XML development team in DB2 for z/OS. She has worked on various features of DB2 for z/OS. In addition to her development work, she is the enablement focal point in the team and is involved in many XML enablement activities, like creating XML sample applications, demos, Hands on Labs, and presenting in conferences and XML boot camps. Before she joined the XML team, Jane was a developer in IBM Content Manager. Jane is an IBM Certified System Administrator for WebSphere Application Server, IBM Certified Database Administrator for DB2 Universal Database for z/OS, Linux, UNIX, and Windows, IBM Certified Solution Designer for DB2 Content Manager, IBM Certified Deployment Profession for Tivoli Storage Manager, and IBM Certified Application developer for DB2 Universal Database Family.



Jason Cu (jasoncu@us.ibm.com), Senior Software Engineer, IBM

Photo of Jason CuJason Cu is a senior software engineer who works on the development of DB2 for z/OS. He worked on adding many important features to DB2, including Unicode and XML, and now works on big data integration and SQL architecture. Jason also spent part of his career doing development for InfoSphere BigInsights where he worked on Jaql, BigSQL, and open source.



Rick Chang (crong@us.ibm.com), Advisory Software Engineer , IBM

Photo of Rick ChangRick Chang, a software engineer for over 20 years, has worked on different projects in the area of storage management, e-commerce, contain management, and recently DB2. Rick joined DB2 for z/OS team in 2005 and worked on XML publishing function, XPath, and XQuery features for DB2 for z/OS.



17 June 2014 (First published 20 May 2014)

Also available in Russian

Introduction

This article is the first of a two-part series that describes how to create a big data solution by integrating IBM DB2 for z/OS and IBM InfoSphere BigInsights. In Part 1, learn how to configure the InfoSphere BigInsights connector in DB2 for z/OS. Part 2 demonstrates how to use the user-defined functions Jaql and SQL to analyze machine logs.

Apache Hadoop is an open source software framework that supports data-intensive distributed applications. It runs on large clusters of commodity hardware with high throughput.

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM’s Hadoop-based offering. Using Quick Start Edition, you can try out the features that IBM built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF), and download BigInsights Quick Start Edition now.

InfoSphere BigInsights make it easier to use Hadoop to build big data applications. It enables enterprises to manage and analyze massive volumes of data that businesses generate every day.

Applications such as machine data analysis or fraud detection that don't require the transactional support of a relational database are well-suited to take advantage of the massive parallelism that Hadoop offers. Such applications can be deployed in InfoSphere BigInsights, and results can flow back to DB2, where the data can be integrated with online transactional data.

DB 11 for z/OS provides the connectors and the capability for DB2 applications to easily and efficiently access data in Hadoop through InfoSphere BigInsights. Figure 1 shows how DB2 11 for z/OS interacts with InfoSphere BigInsights.

Figure 1. DB2 11 for z/OS interacts with InfoSphere BigInsights
Diagram with arrows to show data flows between DB2 and Hadoop access

InfoSphere BigInsights connector for DB2 for z/OS

The InfoSphere BigInsights connector for DB2 for z/OS consists of two user-defined functions: JAQL_SUBMIT and HDFS_READ.

  • JAQL_SUBMIT: A scalar, user-defined function that runs a JSON Query Language (Jaql) script on the InfoSphere BigInsights cluster and writes the result to a Hadoop Distributed File System (HDFS) file. Jaql, a high-level query language, is a component of InfoSphere BigInsights, which is used to analyze semi-structured data and perform parallel JSON data processing. JAQL_SUBMIT sends Jaql scripts to InfoSphere BigInsights by using the ad doc Jaql application.
  • HDFS_READ: A user-defined table that reads an HDFS file and produces a result in a table format. The HDFS_READ function can be used to read the results of JAQL_SUBMIT for consumption in DB2. The HDFS_READ function accesses HDFS through the HttpFS REST API.

With these two user-defined functions, SQL users can easily write scripts in Jaql to run in the InfoSphere BigInsights environment and retrieve analytic results from InfoSphere BigInsights.


DSN8JAQL and DSN8HDFS load modules

The load modules that contain the executable programs that implement the JAQL_SUBMIT and HDFS_READ functions are included in DB2 11 for z/OS. The load module DSN8JAQL implements the JAQL_SUBMIT function and the load module DSN8HDFS implements the HDFS_READ function. Both are in the DB2 SDSNLOD2 data set.


JAQL_SUBMIT and HDFS_READ functions

The modules for these functions are included with DB2 11, but they are not enabled in a default DB2 installation or migration. To make these functions available for use in SQL, a DB2 administrator must issue CREATE FUNCTION statements to define these functions to DB2, as shown in Listing 1. The JAQL_SUBMIT and HDFS_READ are formally documented in an IBM technote.

Listing 1. Sample definition of JAQL_SUBMIT and HDFS_READ
CREATE FUNCTION SYSFUN.JAQL_SUBMIT
(SCRIPT VARCHAR(8000),
 PARMS VARCHAR(512),
 URL VARCHAR(512),
 OPTIONS VARCHAR(256))
RETURNS VARCHAR(512)
LANGUAGE C
EXTERNAL NAME DSN8JAQL
PARAMETER STYLE DB2SQL
PARAMETER CCSID UNICODE
PARAMETER VARCHAR NULTERM
FENCED
NOT DETERMINISTIC
EXTERNAL ACTION
DISALLOW PARALLEL
WLM ENVIRONMENT WLMENV3
STAY RESIDENT YES
RUN OPTIONS 'POSIX(ON),XPLINK(ON)'
;

CREATE FUNCTION SYSFUN.HDFS_READ
(URL VARCHAR(256),
 OPTIONS VARCHAR(256))
RETURNS GENERIC TABLE
LANGUAGE C
EXTERNAL NAME DSN8HDFS
PARAMETER STYLE DB2SQL
PARAMETER CCSID UNICODE
PARAMETER VARCHAR NULTERM
FINAL CALL
FENCED
NOT DETERMINISTIC
EXTERNAL ACTION
DISALLOW PARALLEL
SCRATCHPAD 200
WLM ENVIRONMENT WLMENV3
STAY RESIDENT YES
RUN OPTIONS 'POSIX(ON),XPLINK(ON)'
CARDINALITY 100000
;

For your environment, customize the script to specify a valid workload management (WLM) environment.

Hint: WLM environments that are used for these functions must specify the DB2 SDSNLOD2 data set in the STEPLIB.

Notice the RETURNS GENERIC TABLE clause of the HDFS_READ function definition. The clause indicates that the function is an external table function, but the table that this function returns is not defined by the CREATE FUNCTION statement. Instead, the columns and the data types of the table that is returned by the HDFS_READ function are specified in the SQL statement in which the HDFS_READ function is referenced. With this clause, a single table function can return a table of any shape.


Test query for the JAQL_SUBMIT and HDFS_READ functions

After the functions are created, they can be used in SQL. A test query that uses the JAQL_SUBMIT function and the HDFS_READ function can be issued from SPUFI, DSNTEP2, QMF, or any application that executes SQL statements.

In Listing 2, InfoSphere BigInsights is accessed by the JAQL_SUBMIT function. The HDFS_READ function retrieves the result from InfoSphere BigInsights and saves the output in a table format.

Listing 2. JAQL_SUBMIT and HDFS_READ functions used to access InfoSphere BigInsights
SELECT BIGINSIGHTS.HELLO FROM
TABLE
(HDFS_READ
 (JAQL_SUBMIT
  ('["Hello world from DB2 for z/OS"]->
     write(del(location="hdfs:///idz1470/iod00s/lab2e1.csv"));',

   'http://biginsights.demos.ibm.com:14000/webhdfs/v1/idz1470/'
   ||'iod00s/lab2e1.csv?op=OPEN',

   'http://biginsights.demos.ibm.com:8080',
   'timeout=1000'
  ),
  ''
 )
)
AS BIGINSIGHTS(HELLO VARCHAR(50));

This SQL script is a SELECT statement that reads from a table function HDFS_READ. The HDFS_READ function reads data from HDFS on the InfoSphere BigInsights cluster and returns it in tabular format to be processed by SQL. The HDFS_READ function requires two arguments. The first argument is a URL specifying where to read the data. The second argument is an option string, and in this example, the empty string ('') means that no options are specified.

The first argument for HDFS_READ is a function call to JAQL_SUBMIT. JAQL_SUBMIT runs a Jaql script on InfoSphere BigInsights and returns a string that contains the URL that describes the HDFS file that holds the script's output. In this example, that URL becomes the input to HDFS_READ; therefore, HDFS_READ is reading the output file that is generated through JAQL_SUBMIT.

JAQL_SUBMIT requires four arguments. The first argument is a Jaql script. In Listing 3, the Jaql script writes the string Hello world from DB2 for z/OS into an HDFS file hdfs:///idz1470/iod00s/lab2e1.csv.

Listing 3. JAQL script for Hello world
'["Hello world from DB2 for z/OS"]->
     write(del(location="hdfs:///idz1470/iod00s/lab2e1.csv"));'

The second argument is a return-string, which must contain the output of the script that is specified in the first argument. In this example, the width of the SPUFI data set is restricted to 80 characters; therefore, the long URI is broken into two lines that are concatenated together.

In Listing 4, the first part of this URL, http://biginsights.demos.ibm.com:14000/webhdfs/v1 identifies the server from which the file is to be read. Port 14000 is the HttpFS server port. The rest of the URL, /idz1470/iod00s/lab2e1.csv, identifies the HDFS file to be read. This same file is written to by the Jaql script.

Listing 4. Long URI broken into two lines
'http://biginsights.demos.ibm.com:14000/webhdfs/v1/idz1470/'
   ||'iod00s/lab2e1.csv?op=OPEN'

As shown in Listing 5, the third argument is the URL of the InfoSphere BigInsights cluster, including port 8080, which is the InfoSphere BigInsights console port. This cluster is where the Jaql script is to be run.

Listing 5. InfoSphere BigInsights cluster in URL
'http://biginsights.demos.ibm.com:8080'

The fourth argument is an options string, as shown in Listing 6. In this example, set the maximum allowable run time for the Jaql string to be 1,000 seconds.

Listing 6. Timeout options string
'timeout=1000'

The table expression that is returned by HDFS_READ is a generic table expression. The HDFS_READ function is created with clause RETURNS GENERIC TABLE. This clause indicates that the SQL statement determines the columns with names and types that are to be returned by the function. In this exercise, HDFS_READ returns the table BIGINSIGHTS with one column, HELLO with type VARCHAR(50).


InfoSphere BigInsights web console to review the job

InfoSphere BigInsights can be accessed by using the web console application, which provides information similar to a system console that is delivered through a web browser. Use the InfoSphere BigInsights web console application to review the job or check the status while it is running.

To review the job that runs the Jaql script in Listing 3, follow these steps:

  1. In the web console, click the Files tab to open the file browser, as shown in Figure 2.
    Figure 2. Files tab in file browser
    Screen capture of BigInsights web console shows Files tab
  2. As shown in Figure 3, using the directory tree outline on the left, expand idz1470 and iod00s to find the file that was created by the previous job. Look for the file lab2e1.csv, which contains the output of the Jaql script.
    Figure 3. HDFS directory tree
    Screen capture of list of folders in HDFS
  3. Navigate to the Ad hoc Jaql query:
    1. Click the Applications tab (immediately to the right of the Files tab.
    2. Click Ad hoc Jaql query, as shown in Figure 4.
      Figure 4. Application history for ad hoc Jaql query
      Screen capture of settings for ad hoc Jaql query
    3. The Application History window contains a log of jobs that were submitted to the InfoSphere BigInsights cluster. You can see the jobs that are running and the jobs that are completed. Look for your job.

      Hint: Although the jobs have only a machine-generated job name that is not easily correlated to a call from JAQL_SUBMIT, click the job's details to see a start time and an end time (if the job completed). The times can help narrow down which job is yours on a busy system.


Troubleshooting

If the user-defined functions cannot access InfoSphere BigInsights, determine whether the problem is with the network or the problem is an server error that is returned via HTTP. Turn on the diagnostic trace, which can capture the network traffic that is sent and received:

  1. As shown in Listing 7, add the following DD card into the WLM proc:
    Listing 7. DD card added to the WLM proc
    //WSERROR  DD    PATH='/tmp/wsc.err',
    //         PATHOPTS=(ORDWR,OCREAT,OAPPEND),                    
    // PATHMODE=(SIRUSR,SIWUSR,SIRGRP,SIWGRP,SIROTH,SIWOTH)
  2. Alter the user-defined function and add ENVAR("DSNTRACE=ON") to the RUN OPTIONS as shown in Listing 8.
    Listing 8. Run options for turning on tracing
    ALTER FUNCTION SYSFUN.JAQL_SUBMIT
    (SCRIPT  VARCHAR(8000),
     PARMS   VARCHAR(512),  
     URL     VARCHAR(512),  
     OPTIONS VARCHAR(256))  
    RUN OPTIONS 'POSIX(ON),XPLINK(ON),ENVAR("DSNTRACE=ON")';
    
    ALTER FUNCTION SYSFUN.HDFS_READ
    (URL     VARCHAR(256),      
     OPTIONS VARCHAR(256))  
    RUN OPTIONS 'POSIX(ON),XPLINK(ON),ENVAR("DSNTRACE=ON")';
  3. Refresh the WLM environment.

After the WLM environment is refreshed, the UNIX System Services file /tmp/wsc.err (specified in the DD card in Step 1) is created. When the user-defined functions are called, trace data is written into the /tmp/wsc.err file. The trace data is written in ASCII character encoding. As shown in Listing 9, the iconv command can be used in UNIX System Services to convert it to EBCDIC.

Listing 9. iconv command to convert to EBCDIC
iconv -f UTF-8 -t IBM-1047 < wsc.err > wsc_ebcdic.txt

Summary

This article describes how to enable the DB2 11 for z/OS server to integrate with InfoSphere BigInsights by using the InfoSphere BigInsights connector. After you create the functions JAQL_SUBMIT and HDFS_READ on the server, they can be used by SQL to run analysis jobs on InfoSphere BigInsights and to bring the results back to DB2. Jobs can be monitored on InfoSphere BigInsights by using the web console.

In Part 2 of this series, learn how to build an analysis of MVS log data by using Jaql and integrate those results in DB2.


Acknowledgments

Thanks to Joachim Limburg for his comments and assistance with this article.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics, Information Management
ArticleID=971563
ArticleTitle=Integrate DB2 for z/OS with InfoSphere BigInsights, Part 1: Set up the InfoSphere BigInsights connector for DB2 for z/OS
publish-date=06172014