Skip to main content

skip to main content

developerWorks  >  WebSphere | SOA and Web services  >

Enhance WebSphere Service Registry and Repository search

Use open source technologies to add full-text search

developerWorks
Document options

Document options requiring JavaScript are not displayed

Sample code


Rate this page

Help us improve this content


Level: Intermediate

Claudio Morgia (claudio.morgia@it.ibm.com), SOA Architect, IBM

23 Apr 2008

Learn how you can use Apache Lucene and the Spring Framework to create a keywords plug-in to add full-text search to WebSphere® Service Registry and Repository.

Introduction

IBM WebSphere Registry and Repository (hereafter called Service Registry) enables you to store, organize and search technical documents about services, their relationships and their life cycles, in a governed manner. Service Registry is based on layered database technology that stacks an XML database engine over a more traditional relational database management system (RDBMS), in order to provide easy management of XML files in close to their original XML form. Service Registry allows you to describe the logical relationships between sections in XML documents (for example, the interface and data structure descriptions in WSDL files) and to maintain internal relationships between the parts. It also allows you to set relationships between different documents and their logical parts.

In Service Registry, documents, logical objects and relationship life cycles can all be governed through a well-defined, customizable or replaceable process that is implemented through a state machine. Besides the structural static relationships between logical entities described above, Service Registry uses ontologies (specifically, taxonomies) to create dynamic relationships between these entities. These taxonomies induce semantic relationships among objects.

Another relevant function of Service Registry, derived from its internal XML representation, is the ability to search documents and logical entities with an XPath-like language, using both the structural and semantical relationship model to navigate the objects network. However, Service Registry currently lacks a full-text search capability, which means that searching using keywords or other more evolved criteria is not currently possible.

In this article, you'll learn how you can introduce full-text search functionality into Service Registry using its extensibility framework and two open source projects: Apache Lucene and the Spring Framework. We'll use scenarios to illustrate the trade-offs between ease of implementation and pervasiveness of the integration for each solution. The plug-in code for both solutions is available in the Download section.

You should have moderate knowledge of Java™ in order to fully understand the technical implementation, but it's not strictly required to follow the general discussion.

The Service Registry extensibility framework

Service Registry has an extensibility framework that allows you to develop plug-ins that enhance several aspects of the internal behavior of the product itself. The plug-in methods are invoked at specific stages of document manipulation (create, update and delete, hereafter called CRUD operations), and are implicitly intended to allow or deny object changes.

Service Registry allows three types of plug-ins:

  • Validator, which is called just before the CRUD operation is performed, so that the object is not yet persisted and you have the opportunity to change it before it reaches the storage area.
  • Modifier, which is called just after the CRUD operation, and can be used to change information related to the object, but not easily the object itself.
  • Notifier, which is called after completion of the CRUD operation, and can be used to implement a notification mechanism, but cannot modify the object itself.

Figure 1. Logical plug-in structure
Figure 1. Logical plug-in structure

We'll leverage the ability of a validator plug-in to alter the object representation, in order to introduce a new object's property, which will be filled using the keywords (or statistically relevant tokens) derived from the text form of the document that is being processed.

Using the XPath ability to search for substrings into properties, we'll be able to search for documents using a keyword-based approach to full-text search. An example of an XPath query for generic (binary) documents using an ontology is:

/WSRR/GenericDocument[classifiedByAnyOf(.,'<classification URI>')]

But using full-text search features, you can use a query like:
/WSRR/GenericDocument[classifiedByAnyOf(.,'<classification URI>') and matches(@keywords,'.*<my keyword>.*') ] As you can see, the second query uses both ontology (classifiedByAnyOf) and full-text based predicates, enabling integration between the two different search strategies (ontology and full-text). The cornerstone is the predicate matches, which allow for a regular expression match over an attribute (or all the attributes).

The role of the validator plug-in is to extract keywords from the submitted document, put them into the attribute keywords, and let the operator matches and XPath do the rest.

Lucene indexing

The other part of the challenge is to find a smart and reusable way to analyze documents, extract statistically relevant tokens from them, and make them available to Service Registry. We'll use Apache Lucene to do this. Specifically, we'll use a collateral behavior of the main indexing function called token extraction.

When Lucene processes a document for indexing, it performs an analysis (or tokenization) that may be complex and language-dependent, but creates a map of tokens and their frequencies in the document. We'll intercept this phase of the Lucene indexing and extract the tokens using two different implementation strategies but producing the same result.

One thing to keep in mind is that our solution requires us to enable the indexing of many different business documents, ranging from PDFs to Word and Excel documents. Lucene can index only plain text documents, so we need a way to extract the plain text from a rich document format like those above and feed it into Lucene. (Of course, XML files, being plain text, are not required to be transformed in any way.) To do this extraction, we'll leverage two other open source projects:

  1. Apache POI to handle Microsoft™ documents, like those produced by Word and Excel
  2. PDFBox to handle PDF documents

Figure 2 shows the general indexing process:


Figure 2. Indexing process
Figure 2. Indexing process

The same logical scheme is applied to both the RAM- and Spring-based implementations. In the second scenario, the document translator and indexing engine components and their subcomponents are externally defined through a descriptor and are instantiated using Spring.

Lightning fast Lucene in-memory indexing

The Lucene in-memory approach combines RAM-based storage with a very fast document analyzer that trades ease of customization for a highly optimized implementation. Basically, the RAM-based storage means that the file representation of the indexes is kept in RAM and therefore can't be shared among different applications. These indexes remain private and thus avoid the need for elaborate access control strategies. The highly optimized implementation decreases the customization flexibility of Lucene but provides a very fast indexing engine.

If you don't need to share the indexes among different applications, you don't care about the ability to externally declare the structure of the indexing engine, and you are only interested in providing full-text search to Service Registry, this is the strategy for you.

Otherwise, keep reading.



Back to top


The order is coming and its name is Spring

One of the drawbacks of the previous solution is that, from a coding perspective, there's no easy way to provide flexibility without impacting performance and code complexity. Basically, the transformation flow is hard-coded and can't be easily changed without providing some sort of flexibility framework.

A very interesting feature of the Spring Framework is its implementation of the Inversion of Control principle (also known as the Hollywood Principle, as in "Don't call me, I'll call you"), which states that, instead of wiring components through explicit references into the code, you can simply structure the components by declaring their dependencies through interfaces and types, and then let some other component resolve the dependencies and inject the requested component into the requesting one.

From a coding perspective, if you have an explicit dependency like:

class A {
	public void theMethod() {
		B dependency = new B();
		dependency.someOtherMethod();
	}
}
      

you can transform it into something like:

class A {
	public void theMethod(B dependency) {
		dependency.someOtherMethod();
	}
}
      

In this way, you can make the implicit dependency of the first snippet explicit through the method interface and shift the responsibility of creating and injecting the real B object to another component, which can be driven by a configuration file instead of by the code itself. Using this approach of simply altering a configuration file (in the Spring case, an XML descriptor), you can change the component's implementation to use a different indexing strategy.

Without going into great detail about the Spring Framework and its programming model, let's recap the basic logical work flow:

  1. Create an XML file, describing the beans, or components, you want to use and the way you want to use them.
  2. In your code, create an instance of either the ApplicationContext or BeanFactory class, and pass the reference to the XML descriptor.
  3. Start getting the beans.

Another important piece of the picture is that the Spring Framework provides an integration module and a programming model specific to Lucene integration that can greatly facilitate the development of Lucene applications by providing a set of beans that are ready to use in conjunction with the declarative approach of the Spring Framework.

In our case, we'll use Spring to declare the indexing engine, the storage type (RAM or file-based), the document translator (and the way it recognizes document types, in order to be able to dispatch the document to the appropriate handler) and the template bean, which, in Spring parlance, shields Lucene API complexities from the Spring user.



Back to top


Create the plug-in

In this section, you'll learn step-by-step how to code the integration plug-in, as well as learn the technical aspects of the different frameworks involved.

The glue code

Since we'll provide two implementations of the same indexing function, we'll hide those implementations behind a factory (DocumentAnalyzerFactory) that exposes a single, common interface (DocumentAnalyzer), as shown here:

package com.ibm.luceneintegration;

public interface DocumentAnalyzer {
	public static final int DEFAULT_MAX_TERMS = 50;
	
	String[] getMostFrequentTermsAsArray(String name,byte[] content) 
throws Exception ;
	String[] getMostFrequentTermsAsArray(String name,byte[] content, int 
maxTerms) throws Exception ;
	String getMostFrequentTerms(String name,byte[] content) throws 
Exception ;
	String getMostFrequentTerms(String name,byte[] content, int 
maxTerms) throws Exception ;
}

The interface provides four variations of the same method, which logically translates binary content into a list of the most frequent tokens, eventually limiting the length of the list.

The name parameter guesses the document type based on the extension. It is possible to guess the type using different techniques, but for the sake of simplicity, we'll stick with the name-based guess.

The validation plug-in

The Service Registry validation plug-in will use the DocumentAnalyzerFactory to select a document analyzer, use the analyzer to parse each document and retrieve the token list, and then inject the list into an attribute of the XML description metadata associated with the document to enable the XPath mixed query. You'll learn two possible ways to implement this, but you could easily code other implementations that would fit into the plug-in infrastructure.

Figure 3 shows a simplified version of the plug-in code:


Figure 3. Plug-in code
Figure 3. Plug-in code

This plug-in shows the implementation of one of the CRUD methods (3), just to illustrate the code.

If the passed object is of type Document, which means that it's a binary object rather than an XML-based technical document), it can be passed to the injectProperty method responsible for handling it. This method (2) retrieves the content and the document's name, then uses the document analyzer implementation (through the factory) to calculate the set of most frequent tokens, and finally calls the setPropertyValue method (1) to set the value of the keywords attribute to a string representation of the keywords set.

The lightning fast, RAM-based, Lucene implementation

This implementation is based on the Lucene core and a contributed package, namely Lucene Memory, which provides the in-memory indexing feature. As you can see in Figure 2, the indexing work flow includes:

  • Document tokenization, which is the phase that transforms a binary document into plain text, containing most of the document's content
  • An indexing engine, which is responsible to statistically analyze the plain text (derived from the binary document) and compute a list of tokens, sorted by frequency

In our implementation, the first step is realized through the Converter class (which, for the sake of brevity, is not shown here, but is available in the included source). The Converter class extracts the extension from the document name (probably derived from the file name and uses it to dispatch the document to the appropriate text extractor:

  • Apache POI for Microsoft Excel and Word documents
  • PDFBox for PDF documents
  • Java Swing for RTF documents

The AnalyzerUtil class (from the Lucene Memory package) is used to extract the most frequent terms from the plain text. This class is responsible for creating on-the-fly, volatile indexes, directly in RAM. This enables very fast indexing operations, but the index is not accessible for further operations.

The Spring-based implementation

The Spring implementation is based on the use of several components, each providing a specific feature:

  • The Spring Framework to leverage the Inversion of Control pattern and the declarative approach
  • The Spring Modules for the Spring and Lucene integration
  • Lucene Core, for the indexing engine
  • PDFBox, for PDF document handling
  • Apache POI for Microsoft Word and Excel document handling

The Spring Framework is configured using an XML descriptor document like that shown in the simplified example in Figure 4:


Figure 4. Sample Spring configuration file conf.xml

The confiiguration file can be logically split into four major pieces:

  1. Storage beans, which define the storage areas for indexes. In this case, there are two: RAM-based (session only) and file-system-based (persistent).
  2. An indexing engine, which defines the strategy to use for text analysis (the analyzer) and the index storage to use to record indexes.
  3. Document transformations, which specify a strategy to use to guess the document type (in this case, it's based on the extension, but you can change this method), and a map that links extensions to document handlers that are ultimately responsible for transforming specific document types into plain text
  4. A template, which is the object façade that wraps the Lucene APIs into a simpler programming model and is linked to the indexing engine.

These beans are tied together using the dependency Injection mechanism:

  1. The indexing engine (an IndexFactory bean) has a property of type Directory, which can be saturated by one of the two storage beans, both of type Directory.
  2. The template component has a property of type IndexFactory, a descendent of the indexing engine, which can saturate it.

When you put references as the value of a bean's property, you are really calling the dependency injection without having to write some sort of glue code to tie components together.

As you can probably see, the document handler manager bean is not directly linked to any bean, though it should be linked to the template. This is due to a defect in the Spring modules component code.

A Spring object (ApplicationContext or BeanFactory) can use this configuration file to create the beans, inject the dependencies and many other things not shown here.

As you can see, the code for the Spring-based implementation, basically uses ApplicationContext to initialize the Spring components and retrieve the beans to be used:

ctx = new ClassPathXmlApplicationContext("conf.xml");		
mgr = (DocumentHandlerManager)ctx.getBean("documentHandlerManager");
template = (LuceneIndexTemplate)ctx.getBean("template");

For each document processed, the document handler manager bean is used to automatically guess the document type and route it to the appropriate document handler for plain text extraction, as shown here:

Document doc = mgr.getDocumentHandler(name).getDocument(new HashMap(), 
new ByteArrayInputStream(content));
Field contents = doc.getField("contents");
Field myContents = new Field("contents",contents.stringValue(),
Store.YES,Index.TOKENIZED,TermVector.YES);
doc.removeField("contents");
doc.add(myContents);

The first line does the magic, using the externalized document handler manager configuration. The power of Spring is that the plug-in code is completely unaware of which document types it can handle and this set can be easily changed declaratively, with no impact on the existing plug-in code.

The remaining lines work around a default behavior of the document handler manager, which does not create the Document object as able to be indexed, but only to be parsed, which means it wouldn't be possible to retrieve the list of the most frequent terms. Therefore, the code retrieves the content field and then creates a new one that is ready for indexing.

Next the document is sent to the Template object, which does the real indexing, using the indexing engine specified in the configuration file. The template programming model states that, once a document has been inserted into it, the index can be read using a callback method. In our implementation, this method accesses the index, retrieves the terms and frequencies vector, and uses it to feed a SortedMap, which guarantees the sorting over the frequency value, allowing us to extract a subset of terms already ordered by frequencies, as shown here:

int docNumber = reader.termDocs().doc();
TermFreqVector[] freqs = reader.getTermFreqVectors(docNumber);
TermFreqVector vector = freqs[0];
int[] frequencies = vector.getTermFrequencies();
String[] terms = vector.getTerms();
SortedMap<Integer, String> map = new
TreeMap<Integer,String>(Collections.reverseOrder());
		
for (int i=0; i<vector.size(); i++) {
	map.put(new Integer(frequencies[i]), terms[i]);
}

Once you have the ordered map, you can simply extract the limited set of terms, eventually writing them out as a single string (done by the glue integration code).



Back to top


Deploy the plug-in

In this section, you'll learn how to deploy the plug-in in Eclipse, configure the application server, load the plug-in into Service Registry, and test the solution.

Import the plug-in into Eclipse

Once you've downloaded the complete plug-in source project, contained in wsrrplugin.zip, bundled with all dependencies except the Service Registry run-time client, you can import it into an Eclipse installation by completing the following steps:

  1. Launch Eclipse, pointing to an empty workspace.
  2. In Eclipse, select File => Import.
  3. In the Import dialog, select General => Existing Projects into Workspace, and click Next, as shown in Figure 5.

    Figure 5. Select import type
    Figure 5. Select import type

  4. In the Import Projects dialog, shown in Figure 6, browse to the downloaded archive file. Eclipse automatically selects the only project contained in that file.

    Figure 6. Select archive file
    Figure 6. Select archive file

  5. Click Finish to build the project. The final workspace should look something like Figure 7.

    Figure 7. Initial workspace
    Figure 7. Initial workspace

  6. Now generate the binary plug-in JAR file by right-clicking WSRRplug-inExport.jardesc and selecting Open JAR Packager.
  7. In the JAR File Specification dialog, choose the appropriate JAR filename and click Finish, as shown in Figure 8.

    Figure 8. Specify the JAR file
    Figure 8. Specify the JAR file

Configure the application server

In order to run the plug-in, you need to customize the configuration of the application server, especially with respect to the classpath, as described in this section. You can use the solution described to test the plug-in, but you shouldn't use it in a production environment because of the kind of classpath manipulation required.

To test the solution, you'll need the following archive files, which are available for download from this article:

  • wsrrplugin.zip, which contains the plug-in code
  • WSRRFullTextDependencies.zip, which contains the run-time dependencies. Unzip this file to a directory on the machine where Service Registry is running. For example, on a Linux platform, you can create the directory /usr/local/WSRRplug-inDependencies and unzip the dependencies archive with the following command:
    unzip <dependency archive> -d /usr/local/WSRRplug-inDependencies
  • workspace.zip, which contains the workspace files

To customize the application server, do the following:

  1. If the application server hosting Service Registry is not yet running, start it. For example, on a typical Linux installation of Service Registry, enter the command:
    /opt/IBM/WebSphere/AppServer/bin/startServer.sh server1
          

  2. Point your browser to http://localhost:9060/ibm/console, and select Servers => Application servers from the left navigation, then select server1 in the right pane.
  3. In the right pane, select Server Infrastructure => Java and Process Management => Process Definition => Java Virtual Machine. The Java Virtual Machine Configuration dialog displays, as shown in Figure 9.
    • In the Classpath field, specify the full path of each JAR file contained in the directory in which you expanded the dependencies archive.
    • In the Generic JVM arguments field, append the following text:
      -Dcom.ibm.luceneintegration.factoryClass=
      com.ibm.luceneintegration.spring.FullSpringIntegration
      

      This sets a system-wide property, which is read by the factory class used to switch between integration implementations, this time using the Spring model.

    • Click OK to save the configuration.

      Figure 9. Application Server configuration
      Figure 9. Application Server configuration

  4. Log out and restart the server by entering the commands:
    /opt/IBM/WebSphere/AppServer/bin/stopServer.sh server1
    /opt/IBM/WebSphere/AppServer/bin/startServer.sh server1
    

The application server is now configured to host the plug-in, which you can deploy into Service Registry.

Load the plug-in into Service Registry

You should now have your plug-in JAR file, whether you downloaded it or created it from the Eclipse workspace.

  1. Open the Service Registry Web user interface by pointing your browser to http://localhost:9080/ServiceRegistry.
  2. Select Perspective => Configuration, as shown in Figure 10.

    Figure 10. Select Configuration perspective
    Figure 10. Select Configuration           perspective

  3. In the left navigation pane, select Active Configuration Profile => Plug-ins => JARs, then click Load JAR Plug-In in the right pane, as shown in Figure 11.

    Figure 11. Load JAR plug-in
    Figure 11. Load JAR plug-in

  4. In the next dialog, browse to the plug-in file, specify a name, such as WSRRLuceneplug-in, and click OK.
  5. When the Details pane displays, select Validation Properties in the left pane.
  6. In the right pane, click Validation properties plug-in (ValidationProperties), as shown in Figure 12.

    Figure 12. Select validation properties
    Figure 12. Select validation           properties

  7. In the right pane, select the Content tab, and look for the following text:
    validators=com.ibm.sr.api.SRTemplateValidator,com.ibm.sr.api.SRBusinessModelValidator
          

    Append a comma and the following value (which is the class name of our new integration validator):
    com.ibm.luceneintegration.wsrr.FullTextValidator
          

  8. Click OK.

The validator is now in place and hooked to Service Registry. The next document you process will contain the new attribute keywords, allowing you to do full-text search in the XPath queries.

Test the solution

To make sure everything works as expected, do the following:

  1. Switch to the Service Registry Administrator perspective.
  2. In the left navigation pane, select Service Documents => Other Documents, then click Load Documents in the right pane.
  3. Browse to a document you want to search (PDF, Excel or Word), enter a namespace, description, and version for the document, and click OK, as shown in Figure 13.

    Figure 13. Import document
    Figure 13. Import document

  4. Click Finish in the confirmation message.
  5. When the document is loaded, you should see a summary page with a link to the document. Click this link to go to a document description page.
  6. On the document description page, click the Properties link. The Properties dialog displays, as shown in Figure 14.

    Figure 14. Document properties with the new keywords property
    Figure 14. Document properties with the           new keywords property

You should see the new keywords property, containing the list of the most relevant tokens found in the document.



Back to top


Now for the real fun! Try out a search

The Service Registry Web UI has a query wizard that walks you through creating an XPath query, but it currently doesn't accept wildcards, which prevents us from using it to test our new search feature. However, we can work around this limitation by leveraging the Service Registry saved searches feature.

  1. In the Service Registry Web UI, select Queries => Query Wizard in the left pane, then select All Entities and click Next in the right pane, as shown in Figure 15.

    Figure 15. Open the Query wizard
    Figure 15. Open the Query wizard

  2. In the next dialog, specify Dummy for the property name and Dummy for the property value, then click Next, then Finish.
  3. The query will likely return no results, but you will be given the option to save the query, as shown in Figure 16. In the Save this Search section, specify a name for the search, such as mysearch, and click Save.

    Figure 16. Save the query
    Figure 16. Save the query

  4. In the left pane, select MyService Registry => My Saved Searches => mysearch, as shown in Figure 17.

    Figure 17. Select saved search
    Figure 17. Select saved search

  5. In the Details dialog, select Properties under Additional Properties and click queryExpression.
  6. Enter an XPath query, using the matches predicate to exploit the full-text search capability, and click OK. For example, to search for the specific IBM Redbook Patterns: SOA Design Using WebSphere Message Broker and WebSphere ESB that we imported in the previous section, you could use the following XPath query in the executeQuery property:
    /WSDL/GenericDocument[matches(@keywords,".*soa.*")]
          

  7. Re-run the query by selecting MyService Registry => My Saved Searches => mysearch and clicking Run Search, as shown in Figure 18.

    Figure 18. Re-run the saved query
    Figure 18. Re-run the saved query

You can use this process to test the integration plug-in using both ontology and full-text queries, for example:

/WSRR/GenericDocument[classifiedByAnyOf(.,’<classification URI>’) 
and matches(@keywords,’.*<my keyword>.*’) ]



Back to top


Summary

In this article, you learned how to create a Service Registry plug-in to extend the search functionality. Specifically, you saw how you can use the open source Apache Lucene project to add indexing features to Service Registry and enable you to do full-text searches. You also learned how to use the open source Spring Framework project to externalize some structural aspects of the plug-in code, adding flexibility and the ability to add and change features in a declarative way. Through the companion Spring Modules project, you learned a simplified programming model to interact with Lucene in an easier, Spring-enabled way.

You can use the techniques you've learned to add many other features, such as the ability to handle languages other than English and determine those languages from specific document properties or heuristics, or to integrate semantic searches based on the Lucene query language and its ability to integrate with Wikipedia and WordNet®. The file-based storage, which makes the indexes persistent, also allows you to come back to them and perform other queries, even outside of Service Registry.



Back to top


Acknowledgments

The author would to thank Tim Baldwin and the WebSphere Service Registry and Repository development team for their help in developing and verifying the content of this article.




Back to top


Downloads

DescriptionNameSizeDownload method
Plug-in archive filewsrrplugin.zip9KBHTTP
Dependencies archive fileWSRRFullTextDependencies.zip11.8MBHTTP
Workspace archive fileworkspace.zip17.5MBHTTP
Information about download methods


Resources

Learn

Get products and technologies
  • Lucene: Get complete information on Lucene, including downloads.

  • Spring Framework: Get complete information on the Spring Framework, including downloads.

  • Spring Modules: Get complete information on Spring Modules, including downloads.

  • Spring Modules: Get complete information on Eclipse, including downloads.


About the author

Claudio Morgia photo

Claudio Morgia is an SOA Architect in IBM Software Group in Italy. He works with several large Italian clients, providing architectural guidance for their SOA journeys, especially in the BPM, service management and governance areas. Claudio previously worked in the IBM Tivoli® Lab in Rome as a software designer, focusing on J2EE-based solutions for the IBM WebSphere platform. You can reach Claudio at claudio.morgia@it.ibm.com.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!