Level: Intermediate Li Chen (chenlill@cn.ibm.com), Software Engineer, IBM Xiong Xiong (xiongx@cn.ibm.com), Software Engineer, IBM, Software Group DaWei Zhang (zhangdaw@cn.ibm.com), Staff Software Engineer, IBM
29 Mar 2007 IBM® WebSphere® DataStage™ XML and Web services
packs are components of DataStage that help to deliver fast data integration solutions
when XML and Web services are involved. Explore the main functions and
operations of the DataStage Web Services and XML packs. Learn how to transform,
integrate data, and achieve Google search Web services using these two packs. This
article includes four samples, with a simple overview, detailed steps, instructions, and figures for each sample.
Introduction
DataStage XML and Web services packs overview
Ascential DataStage, acquired by IBM in 2005, is now known as IBM WebSphere
DataStage. It is a convenient, GUI-based extract, transform, and load (ETL) tool that
you can use to integrate organizational data of different structures and formats and effectively support customer relationship management (CRM) analysis, decision support systems (DSS), and e-business.
DataStage is based on a client-server infrastructure and includes designer, manager,
director, and administrator components, providing robust functionality. DataStage can dramatically simplify tasks from development to maintenance.
The XML pack and Web Services pack of DataStage are the focal points of this article. With these two packs, XML and Web services can be easily integrated into DataStage solutions. The advantages of utilizing these two packs with DataStage are:
- Reduced development time
- Fast user response
- User-friendly development, management, and maintenance GUI environment
DataStage XML (XML pack of DataStage) is composed of XML digesting, XML publishing, and transforming functions. The XML section of this article introduces one of the major conversions provided by the XML pack: transformation between XML documents and tabular data.
The Web Services pack enables DataStage to call remote services and treat those
services as simple data sources, data targets, or interactive data interface. The Web Services Transformer, which provides the interactive data interface, is discussed in the Web services part of this article.
This article include four samples of developing data solutions with DataStage (Version. 7.5.1A, with XML and Web services packs installed), along with analysis and notes for those samples. They are based on the following typical scenarios:
- Publishing XML document from tabular data
- Parsing XML document into tabular data
- Accessing a Web service with input and output data
Prerequisites
- WebSphere DataStage V7.5.1A with DB2® pack, XML pack, and Web services pack should be installed.
- IBM® DB2 for Linux®, UNIX®, and Windows® V8.2 (or higher) should be installed.
- The audience is assumed to have some practical experience of XML, Web services, and DB2, as well as basic knowledge of DataStage.
Part I. Publishing XML documents from table data
Publishing XML documents based on the existing table data is a common scenario.
Sometimes, relational tables or sequential files are required to be transformed to XML hierarchical structures, such as XML documents or XML chunks. In this case, the XML output stage can be used to generate XML output. It uses XPath expressions to map input table fields to certain positions in the output documents.
Sample 1. Generate
XML files based on two tables using XML output stage
Figure 1. Job diagram of XML publishing
Overview of sample 1
In sample 1, customer data and contact data has been extracted from two
corresponding DB2 tables respectively, as shown in Figure 1. The
transformer is used to replace the complex SQL, integrate the data, and feed the
joined data to the XML output stage through DSLink6. XML output stage then generates the XML results and saves them to the file system. Figure 1 briefly describes the whole application demo.
The general steps for deployment are:
-
Define and deploy DB2 tables
-
Prepare XML structure
-
Import DB2 table and XML table definitions
-
Set up DB2 stages with a transformer to provide joined data
-
Set up XML output stage to generate XML document
-
Compile and run
Let's examine these steps in detail:
Step 1. Define and deploy DB2 tables
Deploy these tables into the sample database using the DDL shown in Listing 1 on
the local DB2 server, and load some sample data, as shown in figures 2.1 and 2.2:
Listing 1. DDL for customer and contact table
--customer table
CREATE TABLE S_CUST
(
CUST_NUM CHARACTER(10) PRIMARY KEY,
CUST_NAME VARCHAR(100) NOT NULL
);
--contact table
CREATE TABLE S_CONTACT
(
CNT_NUM INTEGER
GENERATED ALWAYS AS IDENTITY( START WITH 1, INCREMENT BY 1 )
PRIMARY KEY,
CUST_NUM CHARACTER(10) NOT NULL,
F_NAME VARCHAR(50) NOT NULL,
L_NAME VARCHAR(50) NOT NULL,
EMAIL VARCHAR(100) NOT NULL
);
|
Figure 2.1. Sample data of customer table
Figure 2.2. Sample data of contact table
Step 2. Prepare XML structure
The target XML structure can be defined either through an XSD schema file or XML
sample document. Listing 2 is an XML sample document to be imported into DataStage:
Listing 2. target.xml - sample XML file
<CustomerInfo xmlns="http://www.ibm.com/schemas/SimpleXMLDemo"
xmlns:cnt="http://www.ibm.com/schemas/ContactDemo"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerID>0000000001</CustomerID>
<CompanyName>MindLight Corp.</CompanyName>
<Contacts>
<cnt:Contact>
<cnt:FirstName>Lee</cnt:FirstName>
<cnt:LastName>Chen</cnt:LastName>
<cnt:Email>LeeChen@gmail.com</cnt:Email>
</cnt:Contact>L
</Contacts>
</CustomerInfo>
|
As shown Listing 2, there are five main fields in the XML file: CustomerID,
CompanyName, FirstName, LastName, and Email. Each of them has a counterpart column in the database tables.
Step 3. Import DB2 table and XML table definitions
To enter the data source information, you need to either input the table definitions manually or import the definitions directly from source data in DataStage. DataStage table definitions include metadata of many kinds of data sources in tabular forms (refer to Figure 3), including the name and location of the tables or files holding the data and a definition of the columns they contain. Information is stored in table definitions in the DataStage repository.
In this sample, two database tables are imported from DB2, and the XML structure is extracted from the sample file.
Note: DB2 and XML plug-in modules should be installed to enable these features.
Figure 3. Table definition structure
Import DB2 table definitions
Select the Plug-in Meta Data Definitions (in red circle, as shown in Figure 4) to import a DB2 table.
Figure 4. Plug-in metadata definitions
Select DB2 plug-in from the pop up window, and click OK. On the first page of
the "Import DSDB2 Meta Data" window (refer to Figure 5), select the SAMPLE
database with proper username and password as the source to import the metadata. Make
sure you select the Tables and Fully Qualified Table Names check boxes
so the table names display with schema name in the next window. Click the Next button.
Figure 5. Import DSDB2 metadata 1
DB2 table definitions can be stored in category "PlugIn\sample," as shown in Figure
6:
Figure 6. Import DSDB2 metadata 2
Import XML table definitions
Right-click on Table definitions in the repository view, and choose Import
> XML Table Definition (similar to the action shown in Figure
4). After doing so, the XML Meta Data Importer will be activated (refer to Figure 7).
Figure 7. XML Meta Data importer
The XML Meta Data Importer can import XML structure from either XML sample or XSD
files. Open the target.xml file in the XML Meta Data
Importer. Once opened, the XML file is mapped into tabular structure. As depicted in
Figure 7, select the TEXT fields of the five data elements (marked in red
circle) as the columns of table definition to extract text content of the XML
elements. The "Description" field (marked in red box of Figure 7) contains the XPath,
which indicates the corresponding position of a column in the XML document. In this
sample, it is assumed that every contact record has a unique e-mail address that is
used as the "Key" column. Save the table definition in the "PlugIn\sample" category, named as "target_xml".
As of now, three table definitions (in red circle of Figure 8) have been imported into the category successfully.
Figure 8. Imported table definitions
Step 4. Set up DB2 stages with a transformer to provide joined
data
With the table definitions ready in repository, it is time to set up the data source
for XML output. There are two options to extract the data:
- Use the SQL, as shown in Listing 3, in one DB2 stage
- Use the transformer to join the data from multiple stages
This sample follows the second option:
Listing 3. Query for the data extraction
SELECT cnt.CUST_NUM,c.CUST_NAME,cnt.F_NAME,cnt.L_NAME,cnt.EMAIL
FROM S_CONTACT cnt
JOIN S_CUST c ON cnt.CUST_NUM=c.CUST_NUM ;
|
Figure 9. Results of query from Listing 3
To achieve the same output shown in Figure 9 in DataStage, two DB2/UDB API stages (in Database category of the Palette) and a Transformer stage (in Processing category of the Palette) are adopted. These stages are added to a new job diagram and named as follows:
- The two DB2 stages: "CONTACT" and "CUST"
- The Transformer: "JOIN_CUST_CNT"
Connect "CONTACT" and "CUST" to the "JOIN_CUST_CNT" using DSLink3 and DSLink4
respectively. (See Figure 10.)
Figure 10. Job diagram
As a database developer, you usually join the customer table to the contact table
and treat the customer as the main table. But in this sample, the contact table is
the main one, indicated by the solid arrow DSLink3. The transformer carries out a
lookup action in non-primary tables for records of the primary table. For a single
record of the primary table, only one record will be retrieved from the non-primary tables.
Note: Only ODBC and UniVerse stages support multi-row lookup.
"CONTACT" stage configuration
Double-click on the "CONTACT" stage to configure its properties. On the "Stage"
page, fill in the DB2 connection information: database name, username, and password.
Switch to the "Output" page, and click on the columns tab. The column
information includes column name, derivation, type, and so on. Columns can be edited
directly or loaded from an existing table definition. Here, use the imported
CHENLI.S_CONTACT table definition to load the columns for this stage. Click the
Load button, select CHENLI.S_CONTACT in the pop-up table definition
window, press OK, then select all the columns of the table definition in the
successive column selection window. When the columns are loaded, the result looks
like Figure 11:
Figure 11. Loaded columns
Set up the SQL in the "CONTACT" stage to extract the data from DB2. Select the
SQL tab, and click SQL Builder. The SQL Builder, a GUI tool, is
activated by default to assist you in completing the SQL query (refer to Figure 12).
Figure 12. SQL builder
Click OK when the SQL is ready. The "SQL" tab should look like Figure 13. You
can now select the "View Data" button to check the "CONTACT" table contents.
Figure 13. SQL tab
As of now, the configuration of the "CONTACT" stage is complete. You can follow the
same steps to set up the "CUST" stage. Afterwards, two tables' icons will be displayed on both links.
"JOIN_CUST_CNT" stage configuration
Now you can configure the transformer "JOIN_CUST_CNT" to aggregate the two input DB2
stages together and produce the joined data. Drag CUST_NUM from DSLink 3 to the
CUST_NUM of DSLink4, as shown in Figure 14:
Figure 14. Transformer stage
Step 5. Setting up the XML output stage to generate XML
document
Add the XML output stage to the diagram from the "Real Time" category of "Palette,"
named "Gen_Cust_XML," and add the link from the "JOIN_CUST_CNT" transformer stage to
it, as shown in Figure 15:
Figure 15. Add XML Output to the job diagram
Double-click on the Gen_Cust_XML stage to set its properties. On the "Stage"
page, you only need to set the "Namespace Declarations" in the "Document Settings"
tab (see Figure 16) and the file path of the output XML file in the "Options" tab
(see Figure. 17) for current solution. Loading namespaces declaration is similar to
loading columns from a table definition -- it loads namespaces from the XML table definition.
Figure 16. Load namespaces declaration
Figure 17. Set output file path
On the "Input" page, you need fill out the columns tab. The process is similar to
loading output columns for DB2 stages. After loading the XML table definition
"target_xml," the XML output stage is shown as illustrated Figure 18.
Figure 18. Load columns from XML table definition
Note: Each customer can own multiple contacts in the target XML structure, so
you need to aggregate the contact information for each customer to reduce redundancy.
For this purpose, set the "transformation settings" as shown in Figure 19:
Figure 19. Transformation settings
Using the trigger column "CustomerID," the output will generate a new record
containing a single XML document only when the "CustomerID" field is changed to a new
value. The data from Figure 9 will be aggregated into a single XML document.
Go back to the "JOIN_CUST_CNT" stage and link the columns on the left side of input
to the output on the right side by dragging those fields. (Refer to Figure 20.)
Figure 20. Transformer stage revisited
Click OK. The XML publishing solution is completed so far.
Step 6. Compile and run
Compile and run this job in the designer. If there is no error, you can add it to
the job schedule as a routine task. When running the job in the designer, the links
with relevant statistics info will turn green if no errors are found (see Figure 21).
If an error occurs, the links will be red. You have to check the logs for debugging. In Figure 21, this job has successfully generated a target XML file: SampleOutput1.xml.
Figure 21. Running the job of sample 1
Part II. Parsing XML document into tabular data
This scenario is a reverse process of publishing XML data -- XML data comes in,
you parse it into tabular records, then insert the records into database tables. Two
samples will be demonstrated to show how to parse XML document into tabular data.
sample 2 is the reverse process of the former XML publishing sample, loading XML
documents to the customer table and the contact table. Sample 3 contains a more
complex XML input as well as db2 outputs (three tables). Different deployment
approaches will be used in these two cases.
Sample 2. Extract
customer and contact information from XML document using XML input stage
Overview of sample 2
In sample 2, use the table definitions from sample 1 (refer to Figure 8) and prepare input date using an XML file with a new company and
two contacts(see input_cust_cnt.xml). In this case,
you use a folder stage to read the XML file, pass the content to the XML input stage
to parse out tabular records, and then use the transformer to make the columns
diverge into different tables. The job diagram is shown in Figure 22:
Figure 22. Sample 2 job diagram
The expected outputs of "XML_CUST_CNT" stage will be similar to the rows shown in Figure 9.
Note: There will be duplicate info on columns CUST_NUM and CUST_NAME, which
will be passed to the customer table and will cause primary key collision. In order
to handle this issue, you can use an aggregator to consolidate such duplicate records
into one record for each customer.
General steps to develop this sample solution include:
-
Set up folder stage to read XML file(s)
-
Set up XML input stage
-
Set up DB2 stages to receive data
-
Configure the aggregator
-
Set up the transformer stage for distributing data
-
Compile and run
Let's examine these steps in detail.
For this sample, you can add all the stages and links shown in Figure 22 to the new job first, then work on the settings of each stage
step by step after.
Step 1. Set up folder stage to read XML file(s)
Double-click the XML_FOLDER stag. Set the folder pathname property in the
"Stage" page. Fill the wild card property (filename wildcard*.txt, for example; we
use *.xml for this case) in the "Output" page.
Then you can load columns for the "XML_FOLDER" stage, adopting the build-in table
definition "Folder," as shown in Figure 23. The "Folder" definition contains two
fields: "FileName" and "Record." The "Record" column will contain the contents of the input files from the source folder.
Figure 23. Load columns from "Folder" table definition
Step 2. Set up XML input stage
Double-click XML_CUST_CNT to open its properties window. Specify "Record" as
the XML source column, from which to get input XML documents. Then set "Column content" to "XML document" accordingly.
Figure 24. Set up XML source
On the "Output" page, as shown in Figure 25, fill in the
"Transformation Settings" tab and the "Columns" tab to enable the "XML_CUST_CNT"
stage to parse the input XML documents and generate output records. First, select
Repetition element required, as the input documents contain multiple contact
elements. Second, load namespaces from the XML table definition (similar to that
shown in Figure 16).
Note: Namespaces declaration is a must when the incoming document contains multiple namespaces and prefixes.
Figure 25. Transformation settings of "XML_CUST_CNT" stage
Generally there is no need to load column definitions for the XML input stage manually because loading namespaces will trigger loading columns of the XML table definition into the "Columns" tab at the same time (refer to Figure 26).
Note: In the XML input Stage, namespaces and columns are loaded at the same
time, unlike the XML output Stage where they have to be loaded separately.
Figure 26. Output columns of "XML_CUST_CNT" stage
Each column definition has an XPath expression in its description field, mapping a
certain element in the XML document to a corresponding output data column. These
XPaths are the most critical settings for parsing XML. In addition, the key field
specifies repetition elements. Here, the e-mail element is chosen to be the key since it is unique for a contact.
Now setup for "XML_CUST_CNT" stage is complete.
Step 3. Set up DB2 stages to receive data
Server information for the two DB2 stages "CONTACT" and "CUST" is configured first.
But unlike previous DB2 stages sample, you now work on the "Input" page of DB2
stage, as input links are used instead of output links. Select the target table (see
Figure 27), and load columns for each DB2 stages.
Note: Auto-generated columns should not be included.
Figure 27. Select target table
For contact table, column "CNT_NUM" is not selected because it is an auto
incremental id column and its value will be generated automatically when the new
record is inserted. On the other hand, all the columns of the customer table are
selected, as they are all from the XML document.
Step 4. Configure the aggregator
Because the XML input stage will generate multiple rows for the same customer, an
aggregator stage is needed to consolidate redundant information of customer for the "CUST" stage.
The input data received by the aggregator should have the same structure as the
output link, which is defined by the "CUST" stage. (Essentially, it is defined by the
DB2 table "CHENLI.S_CUST".) So load the table definition "CHENLI.S_CUST" as input of this aggregator.
On the outputs page of the aggregator, the output columns are "CUST_NUM" and
"CUST_NAME". You need to define the derivation of the two output columns to eliminate duplicate records, as shown in Figure 28.
Figure 28. Aggregator stage
Double-click on the cell that's highlighted in Figure 28 under the derivation
column. A "Derivation" dialog box will appear for configuration.
Figure 29 depicts a detailed configuration of derivation with the aggregate function
"grouped."
Figure 29. Derivation dialog box
Besides "grouped," you can use functions like Max, Min, Count, Average, Sum, and
more for other cases in the Aggregate function selection box.
Step 5. Set up the transformer stage for distributing data
Link the source to the targets in the transformer properties window (see Figure 30).
Figure 30. Distributing data in the transformer
Press OK when finished. Sample 2 is ready to be run.
Step 6. Compile and run
If everything is ok, a view like that shown in Figure 31 can be seen in the designer when the loading is done.
Figure 31. Successfully load a XML document
Check the tables in DB2 using the SQL in Listing 4, and the result is shown in
Figure 32:
Listing 4. Check the tables in DB2
SELECT * FROM S_CUST cust WHERE cust.CUST_NUM='0000000002';
SELECT * FROM S_CONTACT cnt WHERE cnt.CUST_NUM='0000000002';
|
Figure 32. Results of SQLs in Listing 4
Sample 3. Extract
data from the XML document into three tables using a bunch of XML input stages
Overview of sample 3
In sample 2, the XML document (input_cust_cnt.xml) represents a typical scenario that
contains data distributed into only two tables, one referencing the other. This is
a more
complicated case: The XML document has multiple bunches of repetition elements (see
Listing 5) or nested repetition elements (see
Listing 6), which distribute data
into more than two tables. The corresponding relational models can be described as follows:
- Three or more tables are reference to a header table
- Three or more tables nested into a tree structure
- A mixture of 1 and 2
Although this sort of documents can hardly be handled by a single XML input, they
are the most common cases since various related data can be encapsulated into a single document at the same time.
Listing 5. XML document has multiple bunches of repetition elements
<A>
<A_DATA/>
<B>
<B_DATA/>
<B_DATA/>
<B_DATA/>
</B>
<C>
<C_DATA/>
<C_DATA/>
</C>
</A>
|
Listing 6. XML document has multiple bunches of nested repetition elements
<A>
<B>
<B_DATA/>
<C/>
<C/>
<C/>
</B>
<B>
<B_DATA/>
<C/>
<C/>
<C/>
</B>
</A>
|
In order to handle XML documents as those in Listings 5 and 6, a simple solution is
to use a bunch of XML input stages linked by transformer stages to receive and handle
different branches of the document separately. In sample 3, the element "Orders" is
added to the XML structure from sample 2 (see input_cust_cnt.xml) to build a new document, input_cust_cnt_ord.xml (also shown in Listing 7 below).
Per the structure in Listing 5, a new table "S_ORDER" (see Listing
8) is created accordingly.
Listing 7. Input_cust_cnt_ord.xml
<?xml version="1.0" encoding="UTF-8"?>
<CustomerInfo xmlns:cnt="http://www.ibm.com/schemas/ContactDemo"
xmlns="http://www.ibm.com/schemas/SimpleXMLDemo"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerID>0000000003</CustomerID>
<CompanyName>Eastway software Inc.</CompanyName>
<Orders>
<OrderItem>
<OrderNum>00001</OrderNum>
<OrderQty>1.00</OrderQty>
<PartId>A01</PartId>
</OrderItem>
<OrderItem>
<OrderNum>00002</OrderNum>
<OrderQty>2.00</OrderQty>
<PartId>A02</PartId>
</OrderItem>
<OrderItem>
<OrderNum>00003</OrderNum>
<OrderQty>4.00</OrderQty>
<PartId>A03</PartId>
</OrderItem>
</Orders>
<Contacts>
<cnt:Contact>
<cnt:FirstName>Andy</cnt:FirstName>
<cnt:LastName>Lin</cnt:LastName>
<cnt:Email>AndyLin@eastway.com</cnt:Email>
</cnt:Contact>
<cnt:Contact>
<cnt:FirstName>Tom</cnt:FirstName>
<cnt:LastName>Zheng</cnt:LastName>
<cnt:Email>TomZ@eastway.com</cnt:Email>
</cnt:Contact>
</Contacts>
</CustomerInfo>
|
Listing 8. DDL of order table
CREATE TABLE S_ORDER
(
ORDER_NUM CHARACTER(5) PRIMARY KEY,
CUST_NUM CHARACTER(10) NOT NULL,
PART_ID CHARACTER(3) NOT NULL,
QTY DECIMAL(10,3)
);
|
In this solution, two layers of XML input stages are used to process the XML
document in two phases. In the first phase, an XML input stage is used to parse the
customer information and get the XML chunks of contacts and orders. You then pass the
two chunks to two separate XML input stages of the next phase to extract data for the
table "S_CONTACT" and "S_ORDER". The job diagram is shown in Figure 33:
Figure 33. Sample 3 job diagram
General steps to develop this sample solution include:
- Set up the folder stage
- Configure the three XML input stages
- Set up the transformer to link related stages
- Set up DB2 output stage
- Compile and run
Let's examine these steps in detail.
The methods of setting up each stage are already covered in the previous section.
Therefore, this section omits these details and focuses on the tricky parts: the
"Columns" settings of these XML input stages. To start, import the XML table
definitions through the XML Meta Data Importer (refer to Figure 7). Secondly, modify the column lists and XPaths of the definitions when applying them.
The first XML input stage is named "XML_Cust." Set up its output link columns to
extract the fields CustomerID and CompanyName for table "S_CUST" and XML chunks for
<Orders/> and <Contacts/>, as shown in Figure 34:
Figure 34. Output Columns tab of "XML_Cust"
Note: In this particular case, in order to extract XML chunks,/text() should not be appended to the end of the XPath expression,
"/ns1:CustomerInfo/ns1:Orders," for example. Accordingly, the column will return the whole XML structure of that element and can easily be further processed by other stages.
In order to receive XML chunks and produce records with references to the customer
table, you have to set the other two XML input stages manually to get CUST_NUM field
and XML chunk field in the input columns, like Figure 35.
Figure 35. Input Columns tab of "XML_Contact"
When the input columns of "XML_Contact" stage and "XML_Order" stage are ready, use
the transfomer "Div_branches" to connect them.
Figure 36. Distributing data in "Div_branches" stage
Then the output columns of "XML_Contact" and "XML_Order" stages are configured respectively.
Note: There is no transformer between the XML input stage and the DB2 stage to
map the output fields to DB2 columns. So you have to make sure the names of the
output columns in the XML input stage ("XML_Contact" and "XML_Order") are exactly the
same with the names in the database tables. Once they are all set, the input columns
in the DB2 stages will be loaded automatically. DO NOT override the input columns in
DB2 stages by reloading table definitions, otherwise the XPath in the foregoing XML input stage will be lost.
Figure 37. Output columns of "XML_Contact" stage
Figure 38. Output columns of "XML_Order" stage
Note: "CUST_NUM" field exists in both input and output columns of the two XML
Input stages. This field is a "pass-through" field that is directly passed from the input link to the output link.
Run the job after setting up all the stages, and you should be able to see a successful view like Figure 39.
Figure 39. Successfully load an XML
file into three tables
Check the data in DB2.
Listing 9. SQL for checking table data
SELECT * FROM S_CUST cust WHERE cust.CUST_NUM='0000000003';
SELECT * FROM S_CONTACT cnt WHERE cnt.CUST_NUM='0000000003';
SELECT * FROM S_ORDER ord WHERE ord.CUST_NUM='0000000003';
|
Figure 40. Results of SQLs in Listing 9
Part III. Accessing
Web services with input and output data
With Web services pack, DataStage can easily access Web service operations within a
DataStage server job. The Web service pack will add two real-time stages to
DataStage: Web services client stage and Web services transformer stage. When the Web
service is supposed to act as either a data source or a data target, the Web services
client stage is used. In this case, only one link, input or output is required in a
single Web service operation. However, the Web services transformer stage has both
input and output links and can interact with remote services in a sense of data
transaction. The Web services pack will also enable DataStage to support Web service
routines, which are used as transformer functions in transformer stages. This part of
the article only uses and discusses Web services transformer in the sample, as it is
more typical and flexible for adopting Web services in DataStage.
To integrate a Web services transformer into a solution, you need to import the WSDL
Web services description file, add a Web services transformer to your solution, and
choose the target service in this Web services transformer. Then feed data to it and handle the output in proper ways, like extracting and saving data to files or databases. The following sample is intended to demonstrate this feature in a practical case.
Sample 4. Query
Google with keywords from a database table and store the responses in another table
Overview of sample 4
GoogleSearch service is selected as the target service since it is a typical and
common scenario: a query is submitted, then records are returned. Besides, the Web
services is already available and free to be used. GoogleSearch service is in the Google SOAP Search API
project (see Resources),
including three operations: doGoogleSearch, doGetCachedPage and doSpellingSuggestion.
In this example, the doGoogleSearch operation is used. Before accessing the service,
you have to sign up for a free account (a Web page address) in order to get a license
key that is sent in the requested messages to identify the requester. In the
transaction of doGoogleSearch operation, a query is sent to Google with the license
key, and a response message returns containing several hits in a repetition structure.
Note: There is a limitation that Web services transformer in DataStage 7.5.1A
cannot parse repetition elements in the response into multiple rows of records. Each
input row can only generate one row of output data. Fortunately, Web services
transformer uses XPath to extract data from the response document and supports XML
chunks output. Thus, you can extract the XML chunks containing the repetition
structure from an original message and then pass them to following XML input stage(s). In this way, the limitation can be bypassed and even more complex structure can be processed.
The target solution of sample 4 will extract query keywords from a DB2 table and
feed them to the Web services transformer. Web services transformer will invoke the
Web services and output the response messages directly to a following XML input
stage. Finally, you put the results into another DB2 table. The job diagram is shown
in Figure 41:
Figure 41. The expected job diagram
of sample 4
In this sample, several steps will be taken to set up the job, which include:
-
Import the target services from WSDL file
-
Set up Web services transformer
-
Set up the XML Input stage
-
Create DB2 tables and set up database stages for input and
output
-
Compile and run
Let's examine these steps in detail.
Step 1. Import the target service from WSDL file
Right-click on Table Definition in the repository view, and select Import
> Web Services WSDL Definitions (refer to Figure
4).The Web Services Meta Data Importer is activated (see Figure 42). Use it to load
the WSDL file. Select the target operations, then click the Import button to save the definition.
Figure 42. Web Services Meta Data Importer
Several table definitions will be generated for an operation imported.
doGoogleSearch is not an except; the following table definitions are supposed to be created automatically:
- Info_WS: Web Services Meta Data like URI, port name and etc.
- doGoogleSearch_IN: input parameters
- doGoogleSearch_MSGIN: input message meta data
- doGoogleSearch_MSGOUT: output message meta data
- doGoogleSearch_OP: operation meta data
- doGoogleSearch_OUT: output parameters
Step 2. Set up Web services transformer
Web services transformer is the core component of the solution since data structure
of input and output depends on it fully. Therefore, Web services transformer is to be
settled with high priority. In the stage page of Web services transformer, only the
Web services operation needs to be configured to access an open public service like GoogleSearch with basic function, which is discussed in this sample.
Note: Moreover, the stage page provides many more options such as setting
proxy, HTTP AUTH, HTTPs, Web services parameters, and so on for more advanced application.
To configure the target operation for Web services transformer, click Select Web
Service Operation in the "Stage" page, and double-click the target operation in
the Web services browser, as shown in Figure 43:
Figure 43. Select Web Service Operation
Switch to the "Input Message" tab on the "Input" page(see Figure 44).
Figure 44. Load input message metadata
Click Load
Message Information to load metadata, namespaces, and columns automatically from
the selected Web services operation.
Figure 45. Input columns are also loaded when metadata is loaded
The columns shown in Figure 45 are the actual parameters that need to be fed to
Google in the doGoogleSearch operation, such as license key (key), query keyword (q), and many other configuration parameters.
The settings for output link is similar to the input link -- you load metadata in
its "Output Message" tab of the Output page. As mentioned in step 1 of sample 4, an
XML input stage is used to handle the repetition element in responses. The output
link of this Web services transformer has to be customized to output the whole SOAP
document of the response. In Figure 46, the column "RS_XML" contains the response
in the form of a valid XML document:
Figure 46. customized output columns of "GoogleWST" stage
Step 3. Setting up the XML input stage
Configuring the XML input stage is already covered by previous samples (refer to
step 2 of sample 2). The only difference here is the table definition for output
link, which is not from an imported XML table definition but from the Web services
output table definition, named "doGoogleSearch_OUT". There are lots of columns in
this table definition, representing every data element defined in the response
message. This sample only uses a portion of them: URL, snippet, title, cached size,
related information present, host name, and the query keyword itself for test purpose.
Figure 47. Output columns of the
"RS_parsing" XML input stage
Note: The URL column is set to be the "Key" column because it can distinguish each search hit returned.
Step 4. Create DB2 tables and set up database stages for input
and output
Create a simple query table "S_GOOGLE_Q" to contain only id and keyword field in it
and a result table named "S_GOOGLE_RS" covering all the response columns picked out previously:
Listing 10. DDLs for query table and response table
CREATE TABLE S_GOOGLE_Q
(
Q_ID INTEGER
GENERATED ALWAYS AS IDENTITY( START WITH 1, INCREMENT BY 1)
PRIMARY KEY,
KEYWORD VARCHAR(255)
);
CREATE TABLE S_GOOGLE_RS
(
R_ID INTEGER
GENERATED ALWAYS AS IDENTITY( START WITH 1, INCREMENT BY 1)
PRIMARY KEY,
URL VARCHAR(255) NOT NULL,
CACHEDSIZE VARCHAR(255) NOT NULL,
HOSTNAME VARCHAR(255) NOT NULL,
RELATEDINFORMATIONPRESENT VARCHAR(255) NOT NULL,
SNIPPET VARCHAR(255) NOT NULL,
TITLE VARCHAR(255) NOT NULL,
SEARCHQUERY VARCHAR(255) NOT NULL
);
|
Set up two DB2 stages, "REQUESTS" and "RESULTS," for the two DB2 tables. In the
"REQUESTS" DB2 stage, columns should include all the fields required by the
doGoogleSearch operation. The table definition is also loaded from the Web services table definition "doGoogleSearch_IN".
Note: If SQL type column is "unknown," as shown in Figure
45, you have to assign a regular type like "VarChar" to them.
Figure 48. Output columns of "REQUEST" DB2 stage
Select Enter custom SQL statement in the SQL tab, and fill in a query
to extract keyword from "S_GOOGLE_Q" table, including constant columns for the other
input parameters of the Web services operation (see Figure 49).
Figure 49. Use custom SQL statements to prepare data
A transformer has to be used to connect the "REQUEST" stage to "GoogleWST" stage.
Otherwise the job won't start even if the column names exactly match in the two
stages. This is possibly caused by the "UnKnown" SQL type in the Web services table definition.
Figure 50. "COL_mapping" transformer setting
Now configure the "RESULTS" DB2 stage. You only need to configure the server info
and select the target table. The column names in the table definition of the XML
input stage, "RS_parsing" (refer to Figure 41), is exactly the
same with the names in the database table. DB2 stage can insert these columns into
the DB2 table automatically since it's using input column names to create insert statements.
Step 5. Compile and run
The job is available to be compiled and run, as the above settings are complete. But
before accessing the online service, some keywords should be prepared in the query
table. For instances, three keywords, "Beijing," "space shuttle," and "iguana," are
used. (See Figure 51.)
Figure 51. Done google search successfully
From Figure 51, you can see that 30 records have been inserted into the result table
-- 10 for each keyword. Check the "S_GOOGLE_RS" table to examine the data.
Listing 11. SQL to check results returned from google
SELECT searchquery, url, hostname FROM S_GOOGLE_RS;
|
Figure 52. Results of SQL in Listing 11
Additional: Use user-defined message to invoke Web services
Sometimes, the input message is very complex, or the message body is already available. In such cases, using table data as input may not be the best solution.
The Web services transformer supports user-defined messages and accepts XML
documents as message body. This feature provides additional flexibility. You can
choose the appropriate method to prepare the request message, such as manually typing
out a message document, or using a third-party program to generate complex XML documents from the database.
To show the usage of user-defined message, you can slightly change the job in sample
4, using a folder stage to feed an XML file (googleReq.xml) to the Web services transformer. The new job is
shown in Figure 53:
Figure 53. Web services job using
user-defined message
Figure 54. User-defined message
settings of Web services transformer stage
Conclusion
The XML and Web services packs are powerful and convenient toolkits. This article
discussed how to publish an XML document from DB2 tables, how to extract data from
the XML document, and how to access the Web services using four samples, including key procedures as well as analysis.
DataStage provides many other useful components and intuitive ways of combining them to build very flexible integration solutions.
In your own practice, you are sure to find more interesting applications of
DataStage's XML and Web services functionalities and of other DataStage components as well.
Download | Description | Name | Size | Download method |
|---|
| Exported DataStage Job Samples for this article | Samples.zip | 22KB | HTTP |
|---|
Resources Learn
Get products and technologies
-
Build your next development project with
IBM
trial software, available for download directly from developerWorks.
Discuss
About the authors  | 
|  | Li Long Chen is a software engineer in the IBM China Development Lab. He works on technical support for the data layer of e-business applications. His technical focus is on Web applications, Web services, and Java technologies.
|
 | 
|  |
Xiong Xiong is a software engineer in the IBM China Development Lab. He works on technical support for data layer of e-business applications. |
 | 
|  | DaWei Zhang is a staff software engineer in the IBM China Development Lab. He works on technical support for data layer of e-business applications. His interests include DB2, UNIX, Web services, and Java technologies. |
Rate this page
|