Integrate the Information Governance Catalog and IBM InfoSphere DataStage using REST
With the IBM InfoSphere Information Governance Catalog Glossary, you can create, manage, and share an enterprise vocabulary and classification system. InfoSphere Information Governance Catalog Glossary exposes rich functionality through an API that uses a Representational State Transfer (REST)-based service.
IBM InfoSphere DataStage is a data integration tool that lets you move and transform data between operational, transactional, and analytical target systems. You can use the Hierarchical Data stage, which is a stage component within the DataStage Designer, for parsing and composing XML or JSON data, transforming data of hierarchical format, and consuming REST services.
In this article, learn how to integrate the Information Governance Catalog Glossary with the DataStage job flow using the Hierarchical Data stage and other stages. There are many use cases for the Information Governance Catalog Glossary. For example, a business analyst could find and correct incorrect terms, or administrators of business glossaries can search for all the terms under their stewardship. The example in this article guides customers and consultants in solving typical integration issues.
You can download the sample job described in this article.
This article assumes you have a basic understanding of Information Governance Catalog Glossary REST APIs and Hierarchical Data stage, which was called XML stage prior to Information Server V11.3. The Hierarchical Data stage and the new REST step are available in InfoSphere DataStage V11.3.
InfoSphere Information Governance Catalog Glossary REST API
The Information Governance Catalog Glossary REST API allows client applications to read and write business glossary content. The API exposes glossary content as resources. These resources are addressed by using URIs. The resource representations are formatted as XML documents that are described by XML schemas residing on the server. Different operations on Information Governance Catalog Glossary content require different HTTP methods in the request.
Invoking REST services from DataStage
The Hierarchical Data stage is available on the Real Time palette of the DataStage Designer client. The Hierarchical Data stage is responsible for transforming hierarchical data and provides the REST step as in Figure 1 to invoke the REST web services. It consumes REST services deployed with different authentication mechanisms such as BASIC, DIGEST, LTPA, OAUTH, and services enabled with Secure Socket Layer (SSL).
Figure 1. Hierarchical Data stage palette with REST step
The Information Governance Catalog Glossary REST API does not provide the API that helps to find terms using a particular custom attribute. You can design a job to fetch the terms list by using the capabilities of the Hierarchical Data stage and other stages available with DataStage.
In our example scenario, a custom attribute is used in the terms definition. John, an Information Governance Catalog Glossary administrator, wants to delete the custom attribute and create a new custom attribute. Before deleting the custom attribute, he wants to find the list of terms to which the particular custom attribute is defined.
In the Information Governance Catalog Glossary, many terms are available and custom attributes are defined to terms. To identify the terms with the particular custom attribute defined, you must perform the following steps:
- Query all the terms.
- Identify the number of pages the terms are spanned to.
- Traverse all the pages to get the URI for all the terms.
- Get the details of each term.
- Search for the terms that contain the custom attribute for which the Information Governance Catalog Glossary administrator is looking.
- Store the terms in a file.
By using DataStage Designer capabilities, you can design the previous steps as a job to be run to yield your desired result. In the example, the job with four stages on the parallel canvas is created.
Figure 2 shows the Hierarchical Data stage (GetTerms) with the link (AllTermsInfo) to the Transformer stage (GeneratePageNos). The link PageNos is created from the GeneratePageNos stage to the Hierarchical Data stage (Identify_Terms_Having_CustomAttribute) stage. The link ListOfTerms is created from the Identify_Terms_Having_CustomAttribute stage to the Sequential File stage (TermsList).
Figure 2. Job design, integration of Business Glossary with Datastage
The job design achieves the goals in the previous steps. Step 1 is achieved with the GetTerms stage; Step 2 is achieved with the GeneratePageNos stage; Steps 3, 4, and 5 are achieved with Identify_Terms_Having_CustomAttribute; and Step 6 is achieved with the TermsList stage.
Open the Assembly editor of the Hierarchical Data stage by clicking Edit Assembly in the Stage Properties dialog of the GetTerms stage. The REST step named GetTerms and the XML Parser step named GetPageNos are available in the assembly, as shown in Figure 3.
Figure 3. Assembly design of the GetTerms stage
To query the terms, the API exposed by Business Glossary is:
BGPort are job parameters.
You need to provide the values of these job parameters at run time. The
Search API accepts two query parameters:
accepts the value as terms and
search needs to be done on all
the terms, so
pattern is specified as
*. In the
GetTerms step, HTTP method is set to
GET and the URL is set
The General tab is configured as in Figure 4.
Figure 4. General tab details of GetTerms step
The Information Governance Catalog server is configured with SSL, so authentication is required to access the resources. Provide this information in the Security tab of the REST step. Information Governance Catalog Glossary REST web service uses basic authentication and is configured with SSL. As shown in Figure 5, select:
- Authentication: Basic
- Username: username that has permission to access the services
- Password: password for user
- Truststore type: PKCS12 (Information Server uses the PKCS12 key type.)
- Truststore file: iis-client-truststore.p12 is required to communicate with the server. The certificate file iis-client-truststore.p12 is available in the path <Information Server Install path>/ASBNode/conf. The certificate path information and the password are provided.
Figure 5. Security configuration to access Information Governance Catalog Glossary services
Information Governance Catalog Glossary APIs accept the input and send the
response in XML format. As the server sends the response in
application/xml format for the request search API, the content type
application/xml is selected in the Response tab of the GetTerms step.
Completing the configuration of the GetTerms step generates the output
schema. The response sent by the Information Governance Catalog Glossary
server is captured in the
body element in the output schema
that is considered as input to the XML parser step GetPageNos.
The Search API fetches the terms that display 15 entries per page. To
query all the terms, the search API needs to be accessed using the
pageNumber query parameter. The terms in the repository would
not be the same in different time intervals because they keep increasing
in the organization. You must know the number of pages the terms span and
send the Search API request using the
pageNumber as a query
parameter in the URI to get the URI for all the terms. The XML Parser
named GetPageNos is used to parse the response body of the REST call by
choosing SearchResource.xsd as document root to get the value of the
NumberOfPages, which is used as input to the next
stage in the canvas.
To invoke the GET request with
pageNumber as a query
parameter for the Search API, you must pass the page number value in the
URI. A Transformer stage GeneratePageNos is used to create a column that
gives the page number values. The PageNo output column is mapped to the
Derivation called @ITERATION, as in Figure 6. It requires
the sequence of numbers, so a loop condition
@ITERATION<=AllTermsInfo.numberOfPages is added in
the Transformer stage, which creates the numbers until the condition is
satisfied. Supposing the value of
NumberOfPages is 5, the
GeneratePageNos stage creates a sequence of numbers for the column such as
1,2,3,4,5. This column data is passed as input to the next stage:
Figure 6. Properties dialog of GeneratePageNos stage
The Identify_Terms_Having_CustomAttribute stage is designed with a REST step, XML Parser step, REST step, XML Parser step, and Switch step in a sequence, as in Figure 7. This design fetches the terms that have the custom attribute defined.
Figure 7. Assembly design of the Identify_Terms_Having_CustomAttribute stage
The REST step named GetAllTerms in the assembly design fetches all the
terms that are available in the repository. The URL
takes one more query parameter called
PageNumber. Based on
PageNumber value, it fetches the Glossary terms in that
page. The REST step has the option to create a local parameter that can be
mapped to the column which is coming from the previous step. The value for
the URI query parameter named
PageNumber takes its value from
a local parameter called
PageNo, which is mapped to the
column of the GeneratePageNos stage in the Mappings tab, as in Figure 8.
The request is sent to the server with different values of
pageNumber and the output is captured in XML format as a
response body. In the Response tab, the content type application/xml is
Figure 8. Mappings details of GetAllTerms step
The response body, which is in XML format, contains the URI details for all the terms available in the repository. To fetch the URI details, the XML data needs to be parsed. The response body is passed to the XML parser step named GetTermURI. The response body is parsed using the schema searchResource.xsd provided by Information Governance Catalog Glossary to extract the term URI, which is required to fetch the term details.
The term URI is available from the parser step GetTermURI. The REST step named GetTermDetails fetches the details of all the terms. As shown in Figure 9, the URL (https://#BGServer#:#BGPort#/<URI>) uses the local parameter URI that is mapped to the element coming from the previous Parser step. The response from the server is expected in XML format. The content type selected in the Response tab is application/xml.
Figure 9. General details of GetTermDetails step
The output of the GetTermDetails step is passed to the XML Parser step named ParseTermDetails and is parsed using the termResource schema provided by Information Governance Catalog Glossary. The term details have the information about the custom attributes defined. The parsed content is used in the Switch step to find the terms.
The extracted custom attribute information is passed to the Switch step
named FetchTermsHavingCustomAttribute, as shown in Figure 10. The Switch
step filters the terms to a separate list when the condition criteria is
met. The target is created in the Switch step with the condition that the
element containing the custom attribute name in the response body equals
the custom attribute that the Information Governance Catalog Glossary
administrator wants to delete. The records that meet the criteria are
written to the output. The job parameter
used in the target condition
checkForCustomAttribute and the
value of the custom attribute is provided at run time.
Figure 10. Details of FetchTermsHavingCustomAttribute step
The elements of the
checkForCustomAttribute target list are
mapped to the output, as shown in Figure 11.
Figure 11. Details of Output step
The job design is complete after you've performed the previous tasks. Now you can compile and run the job. It prompts you to provide the values for the following job parameters:
- customAttribute: The custom attribute name that the Information Governance Catalog Glossary administrator wants to delete.
- BGServer: The server name or IP address of the machine where Information Governance Catalog Glossary services are running.
- BGPort: The port number on which the Information Governance Catalog Glossary services are running.
- CertificatePath: The path where the iis-client-truststore.p12 file is stored.
- outputDir: The path on the server where the output file should be created.
- username: The user who has access to the services.
- password: The password of the user to access the services.
- certificatePassword: The password to access the certificate iis-client-truststore.p12 file.
In this article, you learned how the DataStage Designer can help you design a job to search for terms to which a particular custom attribute is assigned. You can now apply this knowledge to different scenarios based on your needs.
The author would like to thank Jeff J. Li and Poonam Sharma for their feedback and review of this article.
- The InfoSphere Information Server Knowledge Center has information about the Hierarchical Data stage and reference documentation. The InfoSphere Information Server Knowledge Center also has information about the InfoSphere Business Glossary REST API.
- "Develop an Android application with the InfoSphere Business Glossary REST API" (developerWorks, 2013) discusses creating an Android application with the InfoSphere Business Glossary REST API.
- "Developing a Web 2.0 application using the InfoSphere Business Glossary REST API" (developerWorks, 2011) provides step-by-step instructions to develop a portable, dynamic read-write widget for the web that uses the IBM InfoSphere Business Glossary REST API.