Transform and integrate data using WebSphere DataStage XML and Web services packs
DataStage XML and Web services packs overview
Ascential DataStage, acquired by IBM in 2005, is now known as IBM WebSphere DataStage. It is a convenient, GUI-based extract, transform, and load (ETL) tool that you can use to integrate organizational data of different structures and formats and effectively support customer relationship management (CRM) analysis, decision support systems (DSS), and e-business.
DataStage is based on a client-server infrastructure and includes designer, manager, director, and administrator components, providing robust functionality. DataStage can dramatically simplify tasks from development to maintenance.
The XML pack and Web Services pack of DataStage are the focal points of this article. With these two packs, XML and Web services can be easily integrated into DataStage solutions. The advantages of utilizing these two packs with DataStage are:
- Reduced development time
- Fast user response
- User-friendly development, management, and maintenance GUI environment
DataStage XML (XML pack of DataStage) is composed of XML digesting, XML publishing, and transforming functions. The XML section of this article introduces one of the major conversions provided by the XML pack: transformation between XML documents and tabular data.
The Web Services pack enables DataStage to call remote services and treat those services as simple data sources, data targets, or interactive data interface. The Web Services Transformer, which provides the interactive data interface, is discussed in the Web services part of this article.
This article include four samples of developing data solutions with DataStage (Version. 7.5.1A, with XML and Web services packs installed), along with analysis and notes for those samples. They are based on the following typical scenarios:
- Publishing XML document from tabular data
- Parsing XML document into tabular data
- Accessing a Web service with input and output data
- WebSphere DataStage V7.5.1A with DB2® pack, XML pack, and Web services pack should be installed.
- IBM® DB2 for Linux®, UNIX®, and Windows® V8.2 (or higher) should be installed.
- The audience is assumed to have some practical experience of XML, Web services, and DB2, as well as basic knowledge of DataStage.
Part I. Publishing XML documents from table data
Publishing XML documents based on the existing table data is a common scenario. Sometimes, relational tables or sequential files are required to be transformed to XML hierarchical structures, such as XML documents or XML chunks. In this case, the XML output stage can be used to generate XML output. It uses XPath expressions to map input table fields to certain positions in the output documents.
Sample 1. Generate XML files based on two tables using XML output stage
Figure 1. Job diagram of XML publishing
Overview of sample 1
In sample 1, customer data and contact data has been extracted from two corresponding DB2 tables respectively, as shown in Figure 1. The transformer is used to replace the complex SQL, integrate the data, and feed the joined data to the XML output stage through DSLink6. XML output stage then generates the XML results and saves them to the file system. Figure 1 briefly describes the whole application demo.
The general steps for deployment are:
- Define and deploy DB2 tables
- Prepare XML structure
- Import DB2 table and XML table definitions
- Set up DB2 stages with a transformer to provide joined data
- Set up XML output stage to generate XML document
- Compile and run
Let's examine these steps in detail:
Step 1. Define and deploy DB2 tables
Listing 1. DDL for customer and contact table
--customer table CREATE TABLE S_CUST ( CUST_NUM CHARACTER(10) PRIMARY KEY, CUST_NAME VARCHAR(100) NOT NULL ); --contact table CREATE TABLE S_CONTACT ( CNT_NUM INTEGER GENERATED ALWAYS AS IDENTITY( START WITH 1, INCREMENT BY 1 ) PRIMARY KEY, CUST_NUM CHARACTER(10) NOT NULL, F_NAME VARCHAR(50) NOT NULL, L_NAME VARCHAR(50) NOT NULL, EMAIL VARCHAR(100) NOT NULL );
Figure 2.1. Sample data of customer table
Figure 2.2. Sample data of contact table
Step 2. Prepare XML structure
The target XML structure can be defined either through an XSD schema file or XML sample document. Listing 2 is an XML sample document to be imported into DataStage:
Listing 2. target.xml - sample XML file
<CustomerInfo xmlns="http://www.ibm.com/schemas/SimpleXMLDemo" xmlns:cnt="http://www.ibm.com/schemas/ContactDemo" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <CustomerID>0000000001</CustomerID> <CompanyName>MindLight Corp.</CompanyName> <Contacts> <cnt:Contact> <cnt:FirstName>Lee</cnt:FirstName> <cnt:LastName>Chen</cnt:LastName> <cnt:Email>LeeChen@gmail.com</cnt:Email> </cnt:Contact>L </Contacts> </CustomerInfo>
As shown Listing 2, there are five main fields in the XML file: CustomerID, CompanyName, FirstName, LastName, and Email. Each of them has a counterpart column in the database tables.
Step 3. Import DB2 table and XML table definitions
To enter the data source information, you need to either input the table definitions manually or import the definitions directly from source data in DataStage. DataStage table definitions include metadata of many kinds of data sources in tabular forms (refer to Figure 3), including the name and location of the tables or files holding the data and a definition of the columns they contain. Information is stored in table definitions in the DataStage repository.
In this sample, two database tables are imported from DB2, and the XML structure is extracted from the sample file.
Note: DB2 and XML plug-in modules should be installed to enable these features.
Figure 3. Table definition structure
Import DB2 table definitions
Select the Plug-in Meta Data Definitions (in red circle, as shown in Figure 4) to import a DB2 table.
Figure 4. Plug-in metadata definitions
Select DB2 plug-in from the pop up window, and click OK. On the first page of the "Import DSDB2 Meta Data" window (refer to Figure 5), select the SAMPLE database with proper username and password as the source to import the metadata. Make sure you select the Tables and Fully Qualified Table Names check boxes so the table names display with schema name in the next window. Click the Next button.
Figure 5. Import DSDB2 metadata 1
DB2 table definitions can be stored in category "PlugIn\sample," as shown in Figure 6:
Figure 6. Import DSDB2 metadata 2
Import XML table definitions
Right-click on Table definitions in the repository view, and choose Import > XML Table Definition (similar to the action shown in Figure 4). After doing so, the XML Meta Data Importer will be activated (refer to Figure 7).
Figure 7. XML Meta Data importer
The XML Meta Data Importer can import XML structure from either XML sample or XSD files. Open the target.xml file in the XML Meta Data Importer. Once opened, the XML file is mapped into tabular structure. As depicted in Figure 7, select the TEXT fields of the five data elements (marked in red circle) as the columns of table definition to extract text content of the XML elements. The "Description" field (marked in red box of Figure 7) contains the XPath, which indicates the corresponding position of a column in the XML document. In this sample, it is assumed that every contact record has a unique e-mail address that is used as the "Key" column. Save the table definition in the "PlugIn\sample" category, named as "target_xml".
As of now, three table definitions (in red circle of Figure 8) have been imported into the category successfully.
Figure 8. Imported table definitions
Step 4. Set up DB2 stages with a transformer to provide joined data
With the table definitions ready in repository, it is time to set up the data source for XML output. There are two options to extract the data:
- Use the SQL, as shown in Listing 3, in one DB2 stage
- Use the transformer to join the data from multiple stages
This sample follows the second option:
Listing 3. Query for the data extraction
SELECT cnt.CUST_NUM,c.CUST_NAME,cnt.F_NAME,cnt.L_NAME,cnt.EMAIL FROM S_CONTACT cnt JOIN S_CUST c ON cnt.CUST_NUM=c.CUST_NUM ;
Figure 9. Results of query from Listing 3
To achieve the same output shown in Figure 9 in DataStage, two DB2/UDB API stages (in Database category of the Palette) and a Transformer stage (in Processing category of the Palette) are adopted. These stages are added to a new job diagram and named as follows:
- The two DB2 stages: "CONTACT" and "CUST"
- The Transformer: "JOIN_CUST_CNT"
Connect "CONTACT" and "CUST" to the "JOIN_CUST_CNT" using DSLink3 and DSLink4 respectively. (See Figure 10.)
Figure 10. Job diagram
As a database developer, you usually join the customer table to the contact table and treat the customer as the main table. But in this sample, the contact table is the main one, indicated by the solid arrow DSLink3. The transformer carries out a lookup action in non-primary tables for records of the primary table. For a single record of the primary table, only one record will be retrieved from the non-primary tables.
Note: Only ODBC and UniVerse stages support multi-row lookup.
"CONTACT" stage configuration
Double-click on the "CONTACT" stage to configure its properties. On the "Stage" page, fill in the DB2 connection information: database name, username, and password. Switch to the "Output" page, and click on the columns tab. The column information includes column name, derivation, type, and so on. Columns can be edited directly or loaded from an existing table definition. Here, use the imported CHENLI.S_CONTACT table definition to load the columns for this stage. Click the Load button, select CHENLI.S_CONTACT in the pop-up table definition window, press OK, then select all the columns of the table definition in the successive column selection window. When the columns are loaded, the result looks like Figure 11:
Figure 11. Loaded columns
Set up the SQL in the "CONTACT" stage to extract the data from DB2. Select the SQL tab, and click SQL Builder. The SQL Builder, a GUI tool, is activated by default to assist you in completing the SQL query (refer to Figure 12).
Figure 12. SQL builder
Click OK when the SQL is ready. The "SQL" tab should look like Figure 13. You can now select the "View Data" button to check the "CONTACT" table contents.
Figure 13. SQL tab
As of now, the configuration of the "CONTACT" stage is complete. You can follow the same steps to set up the "CUST" stage. Afterwards, two tables' icons will be displayed on both links.
"JOIN_CUST_CNT" stage configuration
Now you can configure the transformer "JOIN_CUST_CNT" to aggregate the two input DB2 stages together and produce the joined data. Drag CUST_NUM from DSLink 3 to the CUST_NUM of DSLink4, as shown in Figure 14:
Figure 14. Transformer stage
Step 5. Setting up the XML output stage to generate XML document
Add the XML output stage to the diagram from the "Real Time" category of "Palette," named "Gen_Cust_XML," and add the link from the "JOIN_CUST_CNT" transformer stage to it, as shown in Figure 15:
Figure 15. Add XML Output to the job diagram
Double-click on the Gen_Cust_XML stage to set its properties. On the "Stage" page, you only need to set the "Namespace Declarations" in the "Document Settings" tab (see Figure 16) and the file path of the output XML file in the "Options" tab (see Figure. 17) for current solution. Loading namespaces declaration is similar to loading columns from a table definition -- it loads namespaces from the XML table definition.
Figure 16. Load namespaces declaration
Figure 17. Set output file path
On the "Input" page, you need fill out the columns tab. The process is similar to loading output columns for DB2 stages. After loading the XML table definition "target_xml," the XML output stage is shown as illustrated Figure 18.
Figure 18. Load columns from XML table definition
Note: Each customer can own multiple contacts in the target XML structure, so you need to aggregate the contact information for each customer to reduce redundancy. For this purpose, set the "transformation settings" as shown in Figure 19:
Figure 19. Transformation settings
Using the trigger column "CustomerID," the output will generate a new record containing a single XML document only when the "CustomerID" field is changed to a new value. The data from Figure 9 will be aggregated into a single XML document.
Go back to the "JOIN_CUST_CNT" stage and link the columns on the left side of input to the output on the right side by dragging those fields. (Refer to Figure 20.)
Figure 20. Transformer stage revisited
Click OK. The XML publishing solution is completed so far.
Step 6. Compile and run
Compile and run this job in the designer. If there is no error, you can add it to the job schedule as a routine task. When running the job in the designer, the links with relevant statistics info will turn green if no errors are found (see Figure 21). If an error occurs, the links will be red. You have to check the logs for debugging. In Figure 21, this job has successfully generated a target XML file: SampleOutput1.xml.
Figure 21. Running the job of sample 1
Part II. Parsing XML document into tabular data
This scenario is a reverse process of publishing XML data -- XML data comes in, you parse it into tabular records, then insert the records into database tables. Two samples will be demonstrated to show how to parse XML document into tabular data. sample 2 is the reverse process of the former XML publishing sample, loading XML documents to the customer table and the contact table. Sample 3 contains a more complex XML input as well as db2 outputs (three tables). Different deployment approaches will be used in these two cases.
Sample 2. Extract customer and contact information from XML document using XML input stage
Overview of sample 2
In sample 2, use the table definitions from sample 1 (refer to Figure 8) and prepare input date using an XML file with a new company and two contacts(see input_cust_cnt.xml). In this case, you use a folder stage to read the XML file, pass the content to the XML input stage to parse out tabular records, and then use the transformer to make the columns diverge into different tables. The job diagram is shown in Figure 22:
Figure 22. Sample 2 job diagram
The expected outputs of "XML_CUST_CNT" stage will be similar to the rows shown in Figure 9.
Note: There will be duplicate info on columns CUST_NUM and CUST_NAME, which will be passed to the customer table and will cause primary key collision. In order to handle this issue, you can use an aggregator to consolidate such duplicate records into one record for each customer.
General steps to develop this sample solution include:
- Set up folder stage to read XML file(s)
- Set up XML input stage
- Set up DB2 stages to receive data
- Configure the aggregator
- Set up the transformer stage for distributing data
- Compile and run
Let's examine these steps in detail.
For this sample, you can add all the stages and links shown in Figure 22 to the new job first, then work on the settings of each stage step by step after.
Step 1. Set up folder stage to read XML file(s)
Double-click the XML_FOLDER stag. Set the folder pathname property in the "Stage" page. Fill the wild card property (filename wildcard*.txt, for example; we use *.xml for this case) in the "Output" page.
Then you can load columns for the "XML_FOLDER" stage, adopting the build-in table definition "Folder," as shown in Figure 23. The "Folder" definition contains two fields: "FileName" and "Record." The "Record" column will contain the contents of the input files from the source folder.
Figure 23. Load columns from "Folder" table definition
Step 2. Set up XML input stage
Double-click XML_CUST_CNT to open its properties window. Specify "Record" as the XML source column, from which to get input XML documents. Then set "Column content" to "XML document" accordingly.
Figure 24. Set up XML source
On the "Output" page, as shown in Figure 25, fill in the "Transformation Settings" tab and the "Columns" tab to enable the "XML_CUST_CNT" stage to parse the input XML documents and generate output records. First, select Repetition element required, as the input documents contain multiple contact elements. Second, load namespaces from the XML table definition (similar to that shown in Figure 16).
Note: Namespaces declaration is a must when the incoming document contains multiple namespaces and prefixes.
Figure 25. Transformation settings of "XML_CUST_CNT" stage
Generally there is no need to load column definitions for the XML input stage manually because loading namespaces will trigger loading columns of the XML table definition into the "Columns" tab at the same time (refer to Figure 26).
Note: In the XML input Stage, namespaces and columns are loaded at the same time, unlike the XML output Stage where they have to be loaded separately.
Figure 26. Output columns of "XML_CUST_CNT" stage
Each column definition has an XPath expression in its description field, mapping a certain element in the XML document to a corresponding output data column. These XPaths are the most critical settings for parsing XML. In addition, the key field specifies repetition elements. Here, the e-mail element is chosen to be the key since it is unique for a contact.
Now setup for "XML_CUST_CNT" stage is complete.
Step 3. Set up DB2 stages to receive data
Server information for the two DB2 stages "CONTACT" and "CUST" is configured first.
But unlike previous DB2 stages sample, you now work on the "Input" page of DB2 stage, as input links are used instead of output links. Select the target table (see Figure 27), and load columns for each DB2 stages.
Note: Auto-generated columns should not be included.
Figure 27. Select target table
For contact table, column "CNT_NUM" is not selected because it is an auto incremental id column and its value will be generated automatically when the new record is inserted. On the other hand, all the columns of the customer table are selected, as they are all from the XML document.
Step 4. Configure the aggregator
Because the XML input stage will generate multiple rows for the same customer, an aggregator stage is needed to consolidate redundant information of customer for the "CUST" stage.
The input data received by the aggregator should have the same structure as the output link, which is defined by the "CUST" stage. (Essentially, it is defined by the DB2 table "CHENLI.S_CUST".) So load the table definition "CHENLI.S_CUST" as input of this aggregator.
On the outputs page of the aggregator, the output columns are "CUST_NUM" and "CUST_NAME". You need to define the derivation of the two output columns to eliminate duplicate records, as shown in Figure 28.
Figure 28. Aggregator stage
Double-click on the cell that's highlighted in Figure 28 under the derivation column. A "Derivation" dialog box will appear for configuration. Figure 29 depicts a detailed configuration of derivation with the aggregate function "grouped."
Figure 29. Derivation dialog box
Besides "grouped," you can use functions like Max, Min, Count, Average, Sum, and more for other cases in the Aggregate function selection box.
Step 5. Set up the transformer stage for distributing data
Link the source to the targets in the transformer properties window (see Figure 30).
Figure 30. Distributing data in the transformer
Press OK when finished. Sample 2 is ready to be run.
Step 6. Compile and run
If everything is ok, a view like that shown in Figure 31 can be seen in the designer when the loading is done.
Figure 31. Successfully load a XML document
Check the tables in DB2 using the SQL in Listing 4, and the result is shown in Figure 32:
Listing 4. Check the tables in DB2
SELECT * FROM S_CUST cust WHERE cust.CUST_NUM='0000000002'; SELECT * FROM S_CONTACT cnt WHERE cnt.CUST_NUM='0000000002';
Figure 32. Results of SQLs in Listing 4
Sample 3. Extract data from the XML document into three tables using a bunch of XML input stages
Overview of sample 3
In sample 2, the XML document (input_cust_cnt.xml) represents a typical scenario that contains data distributed into only two tables, one referencing the other. This is a more complicated case: The XML document has multiple bunches of repetition elements (see Listing 5) or nested repetition elements (see Listing 6), which distribute data into more than two tables. The corresponding relational models can be described as follows:
- Three or more tables are reference to a header table
- Three or more tables nested into a tree structure
- A mixture of 1 and 2
Although this sort of documents can hardly be handled by a single XML input, they are the most common cases since various related data can be encapsulated into a single document at the same time.
Listing 5. XML document has multiple bunches of repetition elements
<A> <A_DATA/> <B> <B_DATA/> <B_DATA/> <B_DATA/> </B> <C> <C_DATA/> <C_DATA/> </C> </A>
Listing 6. XML document has multiple bunches of nested repetition elements
<A> <B> <B_DATA/> <C/> <C/> <C/> </B> <B> <B_DATA/> <C/> <C/> <C/> </B> </A>
In order to handle XML documents as those in Listings 5 and 6, a simple solution is to use a bunch of XML input stages linked by transformer stages to receive and handle different branches of the document separately. In sample 3, the element "Orders" is added to the XML structure from sample 2 (see input_cust_cnt.xml) to build a new document, input_cust_cnt_ord.xml (also shown in Listing 7 below). Per the structure in Listing 5, a new table "S_ORDER" (see Listing 8) is created accordingly.
Listing 7. Input_cust_cnt_ord.xml
<?xml version="1.0" encoding="UTF-8"?> <CustomerInfo xmlns:cnt="http://www.ibm.com/schemas/ContactDemo" xmlns="http://www.ibm.com/schemas/SimpleXMLDemo" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <CustomerID>0000000003</CustomerID> <CompanyName>Eastway software Inc.</CompanyName> <Orders> <OrderItem> <OrderNum>00001</OrderNum> <OrderQty>1.00</OrderQty> <PartId>A01</PartId> </OrderItem> <OrderItem> <OrderNum>00002</OrderNum> <OrderQty>2.00</OrderQty> <PartId>A02</PartId> </OrderItem> <OrderItem> <OrderNum>00003</OrderNum> <OrderQty>4.00</OrderQty> <PartId>A03</PartId> </OrderItem> </Orders> <Contacts> <cnt:Contact> <cnt:FirstName>Andy</cnt:FirstName> <cnt:LastName>Lin</cnt:LastName> <cnt:Email>AndyLin@eastway.com</cnt:Email> </cnt:Contact> <cnt:Contact> <cnt:FirstName>Tom</cnt:FirstName> <cnt:LastName>Zheng</cnt:LastName> <cnt:Email>TomZ@eastway.com</cnt:Email> </cnt:Contact> </Contacts> </CustomerInfo>
Listing 8. DDL of order table
CREATE TABLE S_ORDER ( ORDER_NUM CHARACTER(5) PRIMARY KEY, CUST_NUM CHARACTER(10) NOT NULL, PART_ID CHARACTER(3) NOT NULL, QTY DECIMAL(10,3) );
In this solution, two layers of XML input stages are used to process the XML document in two phases. In the first phase, an XML input stage is used to parse the customer information and get the XML chunks of contacts and orders. You then pass the two chunks to two separate XML input stages of the next phase to extract data for the table "S_CONTACT" and "S_ORDER". The job diagram is shown in Figure 33:
Figure 33. Sample 3 job diagram
General steps to develop this sample solution include:
- Set up the folder stage
- Configure the three XML input stages
- Set up the transformer to link related stages
- Set up DB2 output stage
- Compile and run
Let's examine these steps in detail.
The methods of setting up each stage are already covered in the previous section. Therefore, this section omits these details and focuses on the tricky parts: the "Columns" settings of these XML input stages. To start, import the XML table definitions through the XML Meta Data Importer (refer to Figure 7). Secondly, modify the column lists and XPaths of the definitions when applying them.
The first XML input stage is named "XML_Cust." Set up its output link columns to extract the fields CustomerID and CompanyName for table "S_CUST" and XML chunks for <Orders/> and <Contacts/>, as shown in Figure 34:
Figure 34. Output Columns tab of "XML_Cust"
Note: In this particular case, in order to extract XML chunks,
/text() should not be appended to the end of the XPath expression,
"/ns1:CustomerInfo/ns1:Orders," for example. Accordingly, the column will return the whole XML structure of that element and can easily be further processed by other stages.
In order to receive XML chunks and produce records with references to the customer table, you have to set the other two XML input stages manually to get CUST_NUM field and XML chunk field in the input columns, like Figure 35.
Figure 35. Input Columns tab of "XML_Contact"
When the input columns of "XML_Contact" stage and "XML_Order" stage are ready, use the transfomer "Div_branches" to connect them.
Figure 36. Distributing data in "Div_branches" stage
Then the output columns of "XML_Contact" and "XML_Order" stages are configured respectively.
Note: There is no transformer between the XML input stage and the DB2 stage to map the output fields to DB2 columns. So you have to make sure the names of the output columns in the XML input stage ("XML_Contact" and "XML_Order") are exactly the same with the names in the database tables. Once they are all set, the input columns in the DB2 stages will be loaded automatically. DO NOT override the input columns in DB2 stages by reloading table definitions, otherwise the XPath in the foregoing XML input stage will be lost.
Figure 37. Output columns of "XML_Contact" stage
Figure 38. Output columns of "XML_Order" stage
Note: "CUST_NUM" field exists in both input and output columns of the two XML Input stages. This field is a "pass-through" field that is directly passed from the input link to the output link.
Run the job after setting up all the stages, and you should be able to see a successful view like Figure 39.
Figure 39. Successfully load an XML file into three tables
Check the data in DB2.
Listing 9. SQL for checking table data
SELECT * FROM S_CUST cust WHERE cust.CUST_NUM='0000000003'; SELECT * FROM S_CONTACT cnt WHERE cnt.CUST_NUM='0000000003'; SELECT * FROM S_ORDER ord WHERE ord.CUST_NUM='0000000003';
Figure 40. Results of SQLs in Listing 9
Part III. Accessing Web services with input and output data
With Web services pack, DataStage can easily access Web service operations within a DataStage server job. The Web service pack will add two real-time stages to DataStage: Web services client stage and Web services transformer stage. When the Web service is supposed to act as either a data source or a data target, the Web services client stage is used. In this case, only one link, input or output is required in a single Web service operation. However, the Web services transformer stage has both input and output links and can interact with remote services in a sense of data transaction. The Web services pack will also enable DataStage to support Web service routines, which are used as transformer functions in transformer stages. This part of the article only uses and discusses Web services transformer in the sample, as it is more typical and flexible for adopting Web services in DataStage.
To integrate a Web services transformer into a solution, you need to import the WSDL Web services description file, add a Web services transformer to your solution, and choose the target service in this Web services transformer. Then feed data to it and handle the output in proper ways, like extracting and saving data to files or databases. The following sample is intended to demonstrate this feature in a practical case.
Sample 4. Query Google with keywords from a database table and store the responses in another table
Overview of sample 4
GoogleSearch service is selected as the target service since it is a typical and common scenario: a query is submitted, then records are returned. Besides, the Web services is already available and free to be used. GoogleSearch service is in the Google SOAP Search API project (see Related topics), including three operations: doGoogleSearch, doGetCachedPage and doSpellingSuggestion. In this example, the doGoogleSearch operation is used. Before accessing the service, you have to sign up for a free account (a Web page address) in order to get a license key that is sent in the requested messages to identify the requester. In the transaction of doGoogleSearch operation, a query is sent to Google with the license key, and a response message returns containing several hits in a repetition structure.
Note: There is a limitation that Web services transformer in DataStage 7.5.1A cannot parse repetition elements in the response into multiple rows of records. Each input row can only generate one row of output data. Fortunately, Web services transformer uses XPath to extract data from the response document and supports XML chunks output. Thus, you can extract the XML chunks containing the repetition structure from an original message and then pass them to following XML input stage(s). In this way, the limitation can be bypassed and even more complex structure can be processed.
The target solution of sample 4 will extract query keywords from a DB2 table and feed them to the Web services transformer. Web services transformer will invoke the Web services and output the response messages directly to a following XML input stage. Finally, you put the results into another DB2 table. The job diagram is shown in Figure 41:
Figure 41. The expected job diagram of sample 4
In this sample, several steps will be taken to set up the job, which include:
- Import the target services from WSDL file
- Set up Web services transformer
- Set up the XML Input stage
- Create DB2 tables and set up database stages for input and output
- Compile and run
Let's examine these steps in detail.
Step 1. Import the target service from WSDL file
Right-click on Table Definition in the repository view, and select Import > Web Services WSDL Definitions (refer to Figure 4).The Web Services Meta Data Importer is activated (see Figure 42). Use it to load the WSDL file. Select the target operations, then click the Import button to save the definition.
Figure 42. Web Services Meta Data Importer
Several table definitions will be generated for an operation imported. doGoogleSearch is not an except; the following table definitions are supposed to be created automatically:
- Info_WS: Web Services Meta Data like URI, port name and etc.
- doGoogleSearch_IN: input parameters
- doGoogleSearch_MSGIN: input message meta data
- doGoogleSearch_MSGOUT: output message meta data
- doGoogleSearch_OP: operation meta data
- doGoogleSearch_OUT: output parameters
Step 2. Set up Web services transformer
Web services transformer is the core component of the solution since data structure of input and output depends on it fully. Therefore, Web services transformer is to be settled with high priority. In the stage page of Web services transformer, only the Web services operation needs to be configured to access an open public service like GoogleSearch with basic function, which is discussed in this sample.
Note: Moreover, the stage page provides many more options such as setting proxy, HTTP AUTH, HTTPs, Web services parameters, and so on for more advanced application.
To configure the target operation for Web services transformer, click Select Web Service Operation in the "Stage" page, and double-click the target operation in the Web services browser, as shown in Figure 43:
Figure 43. Select Web Service Operation
Switch to the "Input Message" tab on the "Input" page(see Figure 44).
Figure 44. Load input message metadata
Click Load Message Information to load metadata, namespaces, and columns automatically from the selected Web services operation.
Figure 45. Input columns are also loaded when metadata is loaded
The columns shown in Figure 45 are the actual parameters that need to be fed to Google in the doGoogleSearch operation, such as license key (key), query keyword (q), and many other configuration parameters.
The settings for output link is similar to the input link -- you load metadata in its "Output Message" tab of the Output page. As mentioned in step 1 of sample 4, an XML input stage is used to handle the repetition element in responses. The output link of this Web services transformer has to be customized to output the whole SOAP document of the response. In Figure 46, the column "RS_XML" contains the response in the form of a valid XML document:
Figure 46. customized output columns of "GoogleWST" stage
Step 3. Setting up the XML input stage
Configuring the XML input stage is already covered by previous samples (refer to step 2 of sample 2). The only difference here is the table definition for output link, which is not from an imported XML table definition but from the Web services output table definition, named "doGoogleSearch_OUT". There are lots of columns in this table definition, representing every data element defined in the response message. This sample only uses a portion of them: URL, snippet, title, cached size, related information present, host name, and the query keyword itself for test purpose.
Figure 47. Output columns of the "RS_parsing" XML input stage
Note: The URL column is set to be the "Key" column because it can distinguish each search hit returned.
Step 4. Create DB2 tables and set up database stages for input and output
Create a simple query table "S_GOOGLE_Q" to contain only id and keyword field in it and a result table named "S_GOOGLE_RS" covering all the response columns picked out previously:
Listing 10. DDLs for query table and response table
CREATE TABLE S_GOOGLE_Q ( Q_ID INTEGER GENERATED ALWAYS AS IDENTITY( START WITH 1, INCREMENT BY 1) PRIMARY KEY, KEYWORD VARCHAR(255) ); CREATE TABLE S_GOOGLE_RS ( R_ID INTEGER GENERATED ALWAYS AS IDENTITY( START WITH 1, INCREMENT BY 1) PRIMARY KEY, URL VARCHAR(255) NOT NULL, CACHEDSIZE VARCHAR(255) NOT NULL, HOSTNAME VARCHAR(255) NOT NULL, RELATEDINFORMATIONPRESENT VARCHAR(255) NOT NULL, SNIPPET VARCHAR(255) NOT NULL, TITLE VARCHAR(255) NOT NULL, SEARCHQUERY VARCHAR(255) NOT NULL );
Set up two DB2 stages, "REQUESTS" and "RESULTS," for the two DB2 tables. In the "REQUESTS" DB2 stage, columns should include all the fields required by the doGoogleSearch operation. The table definition is also loaded from the Web services table definition "doGoogleSearch_IN".
Note: If SQL type column is "unknown," as shown in Figure 45, you have to assign a regular type like "VarChar" to them.
Figure 48. Output columns of "REQUEST" DB2 stage
Select Enter custom SQL statement in the SQL tab, and fill in a query to extract keyword from "S_GOOGLE_Q" table, including constant columns for the other input parameters of the Web services operation (see Figure 49).
Figure 49. Use custom SQL statements to prepare data
A transformer has to be used to connect the "REQUEST" stage to "GoogleWST" stage. Otherwise the job won't start even if the column names exactly match in the two stages. This is possibly caused by the "UnKnown" SQL type in the Web services table definition.
Figure 50. "COL_mapping" transformer setting
Now configure the "RESULTS" DB2 stage. You only need to configure the server info and select the target table. The column names in the table definition of the XML input stage, "RS_parsing" (refer to Figure 41), is exactly the same with the names in the database table. DB2 stage can insert these columns into the DB2 table automatically since it's using input column names to create insert statements.
Step 5. Compile and run
The job is available to be compiled and run, as the above settings are complete. But before accessing the online service, some keywords should be prepared in the query table. For instances, three keywords, "Beijing," "space shuttle," and "iguana," are used. (See Figure 51.)
Figure 51. Done google search successfully
From Figure 51, you can see that 30 records have been inserted into the result table -- 10 for each keyword. Check the "S_GOOGLE_RS" table to examine the data.
Listing 11. SQL to check results returned from google
SELECT searchquery, url, hostname FROM S_GOOGLE_RS;
Figure 52. Results of SQL in Listing 11
Additional: Use user-defined message to invoke Web services
Sometimes, the input message is very complex, or the message body is already available. In such cases, using table data as input may not be the best solution.
The Web services transformer supports user-defined messages and accepts XML documents as message body. This feature provides additional flexibility. You can choose the appropriate method to prepare the request message, such as manually typing out a message document, or using a third-party program to generate complex XML documents from the database.
To show the usage of user-defined message, you can slightly change the job in sample 4, using a folder stage to feed an XML file (googleReq.xml) to the Web services transformer. The new job is shown in Figure 53:
Figure 53. Web services job using user-defined message
Figure 54. User-defined message settings of Web services transformer stage
The XML and Web services packs are powerful and convenient toolkits. This article discussed how to publish an XML document from DB2 tables, how to extract data from the XML document, and how to access the Web services using four samples, including key procedures as well as analysis.
DataStage provides many other useful components and intuitive ways of combining them to build very flexible integration solutions.
In your own practice, you are sure to find more interesting applications of DataStage's XML and Web services functionalities and of other DataStage components as well.
- developerWorks Information Integration resource page: Read articles and tutorials and connect to other resources to expand your IBM Information Server skills.
- "Google Soap Search API" project: Learn more about Google Search Web services.
- Build your next development project with IBM trial software, available for download directly from developerWorks.