This article is the first of a two-part series that describes how to create a big data solution by integrating IBM DB2 for z/OS and IBM InfoSphere BigInsights. In Part 1, learn how to configure the InfoSphere BigInsights connector in DB2 for z/OS. Part 2 demonstrates how to use the user-defined functions Jaql and SQL to analyze machine logs.
Apache Hadoop is an open source software framework that supports data-intensive distributed applications. It runs on large clusters of commodity hardware with high throughput.
InfoSphere BigInsights make it easier to use Hadoop to build big data applications. It enables enterprises to manage and analyze massive volumes of data that businesses generate every day.
Applications such as machine data analysis or fraud detection that don't require the transactional support of a relational database are well-suited to take advantage of the massive parallelism that Hadoop offers. Such applications can be deployed in InfoSphere BigInsights, and results can flow back to DB2, where the data can be integrated with online transactional data.
DB 11 for z/OS provides the connectors and the capability for DB2 applications to easily and efficiently access data in Hadoop through InfoSphere BigInsights. Figure 1 shows how DB2 11 for z/OS interacts with InfoSphere BigInsights.
Figure 1. DB2 11 for z/OS interacts with InfoSphere BigInsights
InfoSphere BigInsights connector for DB2 for z/OS
The InfoSphere BigInsights connector for DB2 for z/OS consists of two
JAQL_SUBMIT: A scalar, user-defined function that runs a JSON Query Language (Jaql) script on the InfoSphere BigInsights cluster and writes the result to a Hadoop Distributed File System (HDFS) file. Jaql, a high-level query language, is a component of InfoSphere BigInsights, which is used to analyze semi-structured data and perform parallel JSON data processing.
JAQL_SUBMITsends Jaql scripts to InfoSphere BigInsights by using the ad doc Jaql application.
HDFS_READ: A user-defined table that reads an HDFS file and produces a result in a table format. The
HDFS_READfunction can be used to read the results of
JAQL_SUBMITfor consumption in DB2. The
HDFS_READfunction accesses HDFS through the HttpFS REST API.
With these two user-defined functions, SQL users can easily write scripts in Jaql to run in the InfoSphere BigInsights environment and retrieve analytic results from InfoSphere BigInsights.
DSN8JAQL and DSN8HDFS load modules
The load modules that contain the executable programs that implement the
HDFS_READ functions are included
in DB2 11 for z/OS. The load module
DSN8JAQL implements the
JAQL_SUBMIT function and the load module
DSN8HDFS implements the
HDFS_READ function. Both
are in the DB2
SDSNLOD2 data set.
JAQL_SUBMIT and HDFS_READ functions
The modules for these functions are included with DB2 11, but they are not
enabled in a default DB2 installation or migration. To make these
functions available for use in SQL, a DB2 administrator must issue
FUNCTION statements to define these
functions to DB2, as shown in Listing 1. The
HDFS_READ are formally documented in an IBM
Listing 1. Sample definition of JAQL_SUBMIT and HDFS_READ
CREATE FUNCTION SYSFUN.JAQL_SUBMIT (SCRIPT VARCHAR(8000), PARMS VARCHAR(512), URL VARCHAR(512), OPTIONS VARCHAR(256)) RETURNS VARCHAR(512) LANGUAGE C EXTERNAL NAME DSN8JAQL PARAMETER STYLE DB2SQL PARAMETER CCSID UNICODE PARAMETER VARCHAR NULTERM FENCED NOT DETERMINISTIC EXTERNAL ACTION DISALLOW PARALLEL WLM ENVIRONMENT WLMENV3 STAY RESIDENT YES RUN OPTIONS 'POSIX(ON),XPLINK(ON)' ; CREATE FUNCTION SYSFUN.HDFS_READ (URL VARCHAR(256), OPTIONS VARCHAR(256)) RETURNS GENERIC TABLE LANGUAGE C EXTERNAL NAME DSN8HDFS PARAMETER STYLE DB2SQL PARAMETER CCSID UNICODE PARAMETER VARCHAR NULTERM FINAL CALL FENCED NOT DETERMINISTIC EXTERNAL ACTION DISALLOW PARALLEL SCRATCHPAD 200 WLM ENVIRONMENT WLMENV3 STAY RESIDENT YES RUN OPTIONS 'POSIX(ON),XPLINK(ON)' CARDINALITY 100000 ;
For your environment, customize the script to specify a valid workload management (WLM) environment.
Hint: WLM environments that are used for these functions
must specify the DB2
SDSNLOD2 data set in the
RETURNS GENERIC TABLE clause of the
HDFS_READ function definition. The clause indicates that the
function is an external table function, but the table that this function
returns is not defined by the
CREATE FUNCTION statement.
Instead, the columns and the data types of the table that is returned by
HDFS_READ function are specified in the SQL statement in
HDFS_READ function is referenced. With this clause,
a single table function can return a table of any shape.
Test query for the JAQL_SUBMIT and HDFS_READ functions
After the functions are created, they can be used in SQL. A test query that
JAQL_SUBMIT function and the
function can be issued from
QMF, or any application that executes SQL statements.
In Listing 2, InfoSphere BigInsights is accessed by the
JAQL_SUBMIT function. The
retrieves the result from InfoSphere BigInsights and saves the output in a
Listing 2. JAQL_SUBMIT and HDFS_READ functions used to access InfoSphere BigInsights
SELECT BIGINSIGHTS.HELLO FROM TABLE (HDFS_READ (JAQL_SUBMIT ('["Hello world from DB2 for z/OS"]-> write(del(location="hdfs:///idz1470/iod00s/lab2e1.csv"));', 'http://biginsights.demos.ibm.com:14000/webhdfs/v1/idz1470/' ||'iod00s/lab2e1.csv?op=OPEN', 'http://biginsights.demos.ibm.com:8080', 'timeout=1000' ), '' ) ) AS BIGINSIGHTS(HELLO VARCHAR(50));
This SQL script is a
SELECT statement that reads from a table
HDFS_READ function reads
data from HDFS on the InfoSphere BigInsights cluster and returns it in
tabular format to be processed by SQL. The
requires two arguments. The first argument is a URL specifying where to
read the data. The second argument is an option string, and in this
example, the empty string (
'') means that no options are
The first argument for
HDFS_READ is a function call to
JAQL_SUBMIT runs a Jaql script on
InfoSphere BigInsights and returns a string that contains the URL that
describes the HDFS file that holds the script's output. In this example,
that URL becomes the input to
HDFS_READ is reading the output file that is generated
JAQL_SUBMIT requires four arguments. The first argument is a
Jaql script. In Listing 3, the Jaql script writes the string
world from DB2 for z/OS into an HDFS
Listing 3. JAQL script for Hello world
'["Hello world from DB2 for z/OS"]-> write(del(location="hdfs:///idz1470/iod00s/lab2e1.csv"));'
The second argument is a return-string, which must contain the output of the script that is specified in the first argument. In this example, the width of the SPUFI data set is restricted to 80 characters; therefore, the long URI is broken into two lines that are concatenated together.
In Listing 4, the first part of this URL, http://biginsights.demos.ibm.com:14000/webhdfs/v1 identifies the server from which the file is to be read. Port 14000 is the HttpFS server port. The rest of the URL, /idz1470/iod00s/lab2e1.csv, identifies the HDFS file to be read. This same file is written to by the Jaql script.
Listing 4. Long URI broken into two lines
As shown in Listing 5, the third argument is the URL of the InfoSphere BigInsights cluster, including port 8080, which is the InfoSphere BigInsights console port. This cluster is where the Jaql script is to be run.
Listing 5. InfoSphere BigInsights cluster in URL
The fourth argument is an options string, as shown in Listing 6. In this example, set the maximum allowable run time for the Jaql string to be 1,000 seconds.
Listing 6. Timeout options string
The table expression that is returned by
HDFS_READ is a
generic table expression. The
HDFS_READ function is created
GENERIC TABLE. This
clause indicates that the SQL statement determines the columns with names
and types that are to be returned by the function. In this exercise,
HDFS_READ returns the table
BIGINSIGHTS with one
HELLO with type
InfoSphere BigInsights web console to review the job
InfoSphere BigInsights can be accessed by using the web console application, which provides information similar to a system console that is delivered through a web browser. Use the InfoSphere BigInsights web console application to review the job or check the status while it is running.
To review the job that runs the Jaql script in Listing 3, follow these steps:
- In the web console, click the Files tab to open the
file browser, as shown in Figure 2.
Figure 2. Files tab in file browser
- As shown in Figure 3, using the directory tree outline on the left,
expand idz1470 and iod00s to find
the file that was created by the previous job. Look for the file
lab2e1.csv, which contains the output of the Jaql script.
Figure 3. HDFS directory tree
- Navigate to the Ad hoc Jaql query:
- Click the Applications tab (immediately to the right of the Files tab.
- Click Ad hoc Jaql query, as shown in Figure
Figure 4. Application history for ad hoc Jaql query
The Application History window contains a log of jobs that were submitted to the InfoSphere BigInsights cluster. You can see the jobs that are running and the jobs that are completed. Look for your job.
Hint: Although the jobs have only a machine-generated job name that is not easily correlated to a call from
JAQL_SUBMIT, click the job's details to see a start time and an end time (if the job completed). The times can help narrow down which job is yours on a busy system.
If the user-defined functions cannot access InfoSphere BigInsights, determine whether the problem is with the network or the problem is an server error that is returned via HTTP. Turn on the diagnostic trace, which can capture the network traffic that is sent and received:
- As shown in Listing 7, add the following
DDcard into the WLM proc:
Listing 7. DD card added to the WLM proc
//WSERROR DD PATH='/tmp/wsc.err', // PATHOPTS=(ORDWR,OCREAT,OAPPEND), // PATHMODE=(SIRUSR,SIWUSR,SIRGRP,SIWGRP,SIROTH,SIWOTH)
- Alter the user-defined function and add
RUN OPTIONSas shown in Listing 8.
Listing 8. Run options for turning on tracing
ALTER FUNCTION SYSFUN.JAQL_SUBMIT (SCRIPT VARCHAR(8000), PARMS VARCHAR(512), URL VARCHAR(512), OPTIONS VARCHAR(256)) RUN OPTIONS 'POSIX(ON),XPLINK(ON),ENVAR("DSNTRACE=ON")'; ALTER FUNCTION SYSFUN.HDFS_READ (URL VARCHAR(256), OPTIONS VARCHAR(256)) RUN OPTIONS 'POSIX(ON),XPLINK(ON),ENVAR("DSNTRACE=ON")';
- Refresh the WLM environment.
After the WLM environment is refreshed, the UNIX System Services file
/tmp/wsc.err (specified in the DD card in Step 1) is
created. When the user-defined functions are called, trace data is written
into the /tmp/wsc.err file. The trace data is written in ASCII character
encoding. As shown in Listing 9, the
iconv command can be
used in UNIX System Services to convert it to EBCDIC.
Listing 9. iconv command to convert to EBCDIC
iconv -f UTF-8 -t IBM-1047 < wsc.err > wsc_ebcdic.txt
This article describes how to enable the DB2 11 for z/OS server to
integrate with InfoSphere BigInsights by using the InfoSphere BigInsights
connector. After you create the functions
HDFS_READ on the server, they can be used by SQL to run
analysis jobs on InfoSphere BigInsights and to bring the results back to
DB2. Jobs can be monitored on InfoSphere BigInsights by using the web
In Part 2 of this series, learn how to build an analysis of MVS log data by using Jaql and integrate those results in DB2.
Thanks to Joachim Limburg for his comments and assistance with this article.
- Integrate DB2 for z/OS with InfoSphere BigInsights, Part 2: Use the InfoSphere BigInsights connector to perform analysis using Jaql and SQL (developerWorks, June 2013): Set up integration between IBM DB2 11 for z/OS and IBM InfoSphere BigInsights so BigInsights can access structured and non-structured data stored in the HDFS, return the results to DB2, and integrate the data with online transactional data.
- See the InfoSphere BigInsights product documentation.
- See to the DB2 for z/OS product documentation.
- Read Using InfoSphere BigInsights user-defined functions for DB2 for z/OS for information about DB2 functions for integrating with Jaql and HDFS.
- Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights.
- Get a technical introduction to InfoSphere BigInsights on Slideshare.
- Read "Understanding InfoSphere BigInsights" to learn more about the product's architecture and underlying technologies.
Get products and technologies
- Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image.
- Ask questions and get answers in the InfoSphere BigInsights forum.
Dig deeper into Big data and analytics on developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Experiment with new directions in software development.
Software development in the cloud. Register today to create a project.
Evaluate IBM software and solutions, and transform challenges into opportunities.