Integrate the Information Governance Catalog and IBM InfoSphere DataStage using REST

IBM® InfoSphere® DataStage® is a data integration tool that lets you move and transform data between operational, transactional, and analytical target systems. In this article, learn to use InfoSphere DataStage to integrate with the REST resources of the Information Governance Catalog Glossary. A sample use case shows how to design a DataStage job to take advantage of the Hierarchical Data stage to access and author the Information Governance Catalog Glossary content. The REST step is a new capability of the Hierarchical Data stage (previously called XML Connector in InfoSphere Information Server) to invoke the REST Web services with support for different authentication mechanisms and Secure Socket Layer (SSL). Learn to parse the response body of the REST call and apply transformations on the data.

Deepa Ramamurthy Yarangatta (dyarrang@in.ibm.com), QA Lead, IBM

Photo of Deepa Ramamurthy YarangattaDeepa Y.R. is a QA lead for the Hierarchical Data stage of the Information Server at IBM India Software Lab, Bangalore. She has been working with IBM on Information Server for the last four years.



03 July 2014

Overview

With the IBM InfoSphere Information Governance Catalog Glossary, you can create, manage, and share an enterprise vocabulary and classification system. InfoSphere Information Governance Catalog Glossary exposes rich functionality through an API that uses a Representational State Transfer (REST)-based service.

IBM InfoSphere DataStage is a data integration tool that lets you move and transform data between operational, transactional, and analytical target systems. You can use the Hierarchical Data stage, which is a stage component within the DataStage Designer, for parsing and composing XML or JSON data, transforming data of hierarchical format, and consuming REST services.

Download and try InfoSphere products.

In this article, learn how to integrate the Information Governance Catalog Glossary with the DataStage job flow using the Hierarchical Data stage and other stages. There are many use cases for the Information Governance Catalog Glossary. For example, a business analyst could find and correct incorrect terms, or administrators of business glossaries can search for all the terms under their stewardship. The example in this article guides customers and consultants in solving typical integration issues.

You can download the sample job described in this article.

Prerequisites

This article assumes you have a basic understanding of Information Governance Catalog Glossary REST APIs and Hierarchical Data stage, which was called XML stage prior to Information Server V11.3. The Hierarchical Data stage and the new REST step are available in InfoSphere DataStage V11.3.


InfoSphere Information Governance Catalog Glossary REST API

The Information Governance Catalog Glossary REST API allows client applications to read and write business glossary content. The API exposes glossary content as resources. These resources are addressed by using URIs. The resource representations are formatted as XML documents that are described by XML schemas residing on the server. Different operations on Information Governance Catalog Glossary content require different HTTP methods in the request.


Invoking REST services from DataStage

The Hierarchical Data stage is available on the Real Time palette of the DataStage Designer client. The Hierarchical Data stage is responsible for transforming hierarchical data and provides the REST step as in Figure 1 to invoke the REST web services. It consumes REST services deployed with different authentication mechanisms such as BASIC, DIGEST, LTPA, OAUTH, and services enabled with Secure Socket Layer (SSL).

Figure 1. Hierarchical Data stage palette with REST step
Hierarchical Data stage palette with REST step

The Information Governance Catalog Glossary REST API does not provide the API that helps to find terms using a particular custom attribute. You can design a job to fetch the terms list by using the capabilities of the Hierarchical Data stage and other stages available with DataStage.


Scenario

In our example scenario, a custom attribute is used in the terms definition. John, an Information Governance Catalog Glossary administrator, wants to delete the custom attribute and create a new custom attribute. Before deleting the custom attribute, he wants to find the list of terms to which the particular custom attribute is defined.


Job design

In the Information Governance Catalog Glossary, many terms are available and custom attributes are defined to terms. To identify the terms with the particular custom attribute defined, you must perform the following steps:

  1. Query all the terms.
  2. Identify the number of pages the terms are spanned to.
  3. Traverse all the pages to get the URI for all the terms.
  4. Get the details of each term.
  5. Search for the terms that contain the custom attribute for which the Information Governance Catalog Glossary administrator is looking.
  6. Store the terms in a file.

By using DataStage Designer capabilities, you can design the previous steps as a job to be run to yield your desired result. In the example, the job with four stages on the parallel canvas is created.

Figure 2 shows the Hierarchical Data stage (GetTerms) with the link (AllTermsInfo) to the Transformer stage (GeneratePageNos). The link PageNos is created from the GeneratePageNos stage to the Hierarchical Data stage (Identify_Terms_Having_CustomAttribute) stage. The link ListOfTerms is created from the Identify_Terms_Having_CustomAttribute stage to the Sequential File stage (TermsList).

Figure 2. Job design, integration of Business Glossary with Datastage
Job Design to illustrate integration of Business Glossary with Datastage

The job design achieves the goals in the previous steps. Step 1 is achieved with the GetTerms stage; Step 2 is achieved with the GeneratePageNos stage; Steps 3, 4, and 5 are achieved with Identify_Terms_Having_CustomAttribute; and Step 6 is achieved with the TermsList stage.

Open the Assembly editor of the Hierarchical Data stage by clicking Edit Assembly in the Stage Properties dialog of the GetTerms stage. The REST step named GetTerms and the XML Parser step named GetPageNos are available in the assembly, as shown in Figure 3.

Figure 3. Assembly design of the GetTerms stage
Assembly design of the stage GetTerms

To query the terms, the API exposed by Business Glossary is:

https://#BGServer#:#BGPort#/ibm/iis/bgrestapi/v4/search?pattern=*&queriedClasses=term

where BGServer and BGPort are job parameters. You need to provide the values of these job parameters at run time. The Search API accepts two query parameters: pattern and queriedClasses. The queriedClasses parameter accepts the value as terms and search needs to be done on all the terms, so pattern is specified as *. In the GetTerms step, HTTP method is set to GET and the URL is set to:

https://#BGServer#:#BGPort#/ibm/iis/bgrestapi/v4/search?pattern=*&queriedClasses=term

The General tab is configured as in Figure 4.

Figure 4. General tab details of GetTerms step
General tab details of GetTerms step

The Information Governance Catalog server is configured with SSL, so authentication is required to access the resources. Provide this information in the Security tab of the REST step. Information Governance Catalog Glossary REST web service uses basic authentication and is configured with SSL. As shown in Figure 5, select:

  • Authentication: Basic
  • Username: username that has permission to access the services
  • Password: password for user
  • Truststore type: PKCS12 (Information Server uses the PKCS12 key type.)
  • Truststore file: iis-client-truststore.p12 is required to communicate with the server. The certificate file iis-client-truststore.p12 is available in the path <Information Server Install path>/ASBNode/conf. The certificate path information and the password are provided.
Figure 5. Security configuration to access Information Governance Catalog Glossary services
Security configuration to access Information Governance Catalog Glossary services

Information Governance Catalog Glossary APIs accept the input and send the response in XML format. As the server sends the response in application/xml format for the request search API, the content type application/xml is selected in the Response tab of the GetTerms step. Completing the configuration of the GetTerms step generates the output schema. The response sent by the Information Governance Catalog Glossary server is captured in the body element in the output schema that is considered as input to the XML parser step GetPageNos.

The Search API fetches the terms that display 15 entries per page. To query all the terms, the search API needs to be accessed using the pageNumber query parameter. The terms in the repository would not be the same in different time intervals because they keep increasing in the organization. You must know the number of pages the terms span and send the Search API request using the pageNumber as a query parameter in the URI to get the URI for all the terms. The XML Parser named GetPageNos is used to parse the response body of the REST call by choosing SearchResource.xsd as document root to get the value of the element NumberOfPages, which is used as input to the next stage in the canvas.

To invoke the GET request with pageNumber as a query parameter for the Search API, you must pass the page number value in the URI. A Transformer stage GeneratePageNos is used to create a column that gives the page number values. The PageNo output column is mapped to the Derivation called @ITERATION, as in Figure 6. It requires the sequence of numbers, so a loop condition @ITERATION<=AllTermsInfo.numberOfPages is added in the Transformer stage, which creates the numbers until the condition is satisfied. Supposing the value of NumberOfPages is 5, the GeneratePageNos stage creates a sequence of numbers for the column such as 1,2,3,4,5. This column data is passed as input to the next stage: Identify_Terms_Having_CustomAttribute.

Figure 6. Properties dialog of GeneratePageNos stage
Properties dialog of the stage GeneratePageNos

The Identify_Terms_Having_CustomAttribute stage is designed with a REST step, XML Parser step, REST step, XML Parser step, and Switch step in a sequence, as in Figure 7. This design fetches the terms that have the custom attribute defined.

Figure 7. Assembly design of the Identify_Terms_Having_CustomAttribute stage
Assembly design of the Identify_Terms_Having_CustomAttribute stage

The REST step named GetAllTerms in the assembly design fetches all the terms that are available in the repository. The URL https://#BGServer#:#BGPort#/ibm/iis/bgrestapi/v4/search?pattern=*&queriedClasses=term&pageNumber=<pageNo> takes one more query parameter called PageNumber. Based on the PageNumber value, it fetches the Glossary terms in that page. The REST step has the option to create a local parameter that can be mapped to the column which is coming from the previous step. The value for the URI query parameter named PageNumber takes its value from a local parameter called PageNo, which is mapped to the column of the GeneratePageNos stage in the Mappings tab, as in Figure 8. The request is sent to the server with different values of pageNumber and the output is captured in XML format as a response body. In the Response tab, the content type application/xml is selected.

Figure 8. Mappings details of GetAllTerms step
Mappings page details of GetAllTerms step

The response body, which is in XML format, contains the URI details for all the terms available in the repository. To fetch the URI details, the XML data needs to be parsed. The response body is passed to the XML parser step named GetTermURI. The response body is parsed using the schema searchResource.xsd provided by Information Governance Catalog Glossary to extract the term URI, which is required to fetch the term details.

The term URI is available from the parser step GetTermURI. The REST step named GetTermDetails fetches the details of all the terms. As shown in Figure 9, the URL (https://#BGServer#:#BGPort#/<URI>) uses the local parameter URI that is mapped to the element coming from the previous Parser step. The response from the server is expected in XML format. The content type selected in the Response tab is application/xml.

Figure 9. General details of GetTermDetails step
General page details of GetTermDetails step

The output of the GetTermDetails step is passed to the XML Parser step named ParseTermDetails and is parsed using the termResource schema provided by Information Governance Catalog Glossary. The term details have the information about the custom attributes defined. The parsed content is used in the Switch step to find the terms.

The extracted custom attribute information is passed to the Switch step named FetchTermsHavingCustomAttribute, as shown in Figure 10. The Switch step filters the terms to a separate list when the condition criteria is met. The target is created in the Switch step with the condition that the element containing the custom attribute name in the response body equals the custom attribute that the Information Governance Catalog Glossary administrator wants to delete. The records that meet the criteria are written to the output. The job parameter customAttribute is used in the target condition checkForCustomAttribute and the value of the custom attribute is provided at run time.

Figure 10. Details of FetchTermsHavingCustomAttribute step
Details of FetchTermsHavingCustomAttribute step

The elements of the checkForCustomAttribute target list are mapped to the output, as shown in Figure 11.

Figure 11. Details of Output step
Details of Output step

The job design is complete after you've performed the previous tasks. Now you can compile and run the job. It prompts you to provide the values for the following job parameters:

  • customAttribute: The custom attribute name that the Information Governance Catalog Glossary administrator wants to delete.
  • BGServer: The server name or IP address of the machine where Information Governance Catalog Glossary services are running.
  • BGPort: The port number on which the Information Governance Catalog Glossary services are running.
  • CertificatePath: The path where the iis-client-truststore.p12 file is stored.
  • outputDir: The path on the server where the output file should be created.
  • username: The user who has access to the services.
  • password: The password of the user to access the services.
  • certificatePassword: The password to access the certificate iis-client-truststore.p12 file.

Conclusion

In this article, you learned how the DataStage Designer can help you design a job to search for terms to which a particular custom attribute is assigned. You can now apply this knowledge to different scenarios based on your needs.


Acknowledgements

The author would like to thank Jeff J. Li and Poonam Sharma for their feedback and review of this article.


Download

DescriptionNameSize
Sample jobGetTermsHavingCustomAttribute.dsx347KB

Resources

Learn

Get products and technologies

  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, or use a product in a cloud environment.

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=975657
ArticleTitle=Integrate the Information Governance Catalog and IBM InfoSphere DataStage using REST
publish-date=07032014