 | Level: Intermediate Claudio Morgia (claudio.morgia@it.ibm.com), SOA Architect, IBM
23 Apr 2008 Learn how you can use Apache Lucene and the Spring Framework to
create a keywords plug-in to add full-text search to WebSphere® Service
Registry and Repository.
Introduction
IBM WebSphere Registry and Repository (hereafter called Service Registry) enables
you to store, organize and search technical documents about services, their
relationships and their life cycles, in a governed manner. Service Registry is
based on layered database technology that stacks an XML database engine over a
more traditional relational database management system (RDBMS), in order to
provide easy management of XML files in close to their original XML form. Service
Registry allows you to describe the logical relationships between sections in XML
documents (for example, the interface and data structure descriptions in WSDL
files) and to maintain internal relationships between the parts. It also allows
you to set relationships between different documents and their logical parts.
In Service Registry, documents, logical objects and relationship life cycles can
all be governed through a well-defined, customizable or replaceable process that
is implemented through a state machine. Besides the structural static
relationships between logical entities described above, Service Registry uses
ontologies (specifically, taxonomies) to create dynamic relationships between
these entities. These taxonomies induce semantic relationships among objects.
Another relevant function of Service Registry, derived from its internal XML
representation, is the ability to search documents and logical entities with an
XPath-like language, using both the structural and semantical relationship model
to navigate the objects network. However, Service Registry currently lacks a
full-text search capability, which means that searching using keywords or other
more evolved criteria is not currently possible.
In this article, you'll learn how you can introduce full-text search
functionality into Service Registry using its extensibility framework and two open
source projects: Apache Lucene and the Spring Framework. We'll use scenarios to
illustrate the trade-offs between ease of implementation and pervasiveness of the
integration for each solution. The plug-in code for both solutions is available in
the Download section.
You should have moderate knowledge of Java™ in order to fully understand
the technical implementation, but it's not strictly required to follow the general
discussion.
The Service Registry extensibility framework
Service Registry has an extensibility framework that allows you to develop
plug-ins that enhance several aspects of the internal behavior of the product
itself. The plug-in methods are invoked at specific stages of document
manipulation (create, update and delete, hereafter called CRUD operations), and
are implicitly intended to allow or deny object changes.
Service Registry allows three types of plug-ins:
-
Validator, which is called just before the CRUD operation is performed,
so that the object is not yet persisted and you have the opportunity to change
it before it reaches the storage area.
-
Modifier, which is called just after the CRUD operation, and can be used
to change information related to the object, but not easily the object itself.
-
Notifier, which is called after completion of the CRUD operation, and can
be used to implement a notification mechanism, but cannot modify the object
itself.
Figure 1. Logical plug-in
structure
We'll leverage the ability of a validator plug-in to alter the object
representation, in order to introduce a new object's property, which will be
filled using the keywords (or statistically relevant tokens) derived from the text
form of the document that is being processed.
Using the XPath ability to search for substrings into properties, we'll be able
to search for documents using a keyword-based approach to full-text search. An
example of an XPath query for generic (binary) documents using an ontology is:
/WSRR/GenericDocument[classifiedByAnyOf(.,'<classification URI>')]
|
But using full-text search features, you can use a query like:
/WSRR/GenericDocument[classifiedByAnyOf(.,'<classification URI>') and matches(@keywords,'.*<my keyword>.*') ]
As you can see, the second query uses both ontology (classifiedByAnyOf) and
full-text based predicates, enabling integration between the two different search
strategies (ontology and full-text). The cornerstone is the predicate
matches, which allow for a regular expression match over an attribute (or
all the attributes).
The role of the validator plug-in is to extract keywords from the submitted
document, put them into the attribute keywords, and let the operator matches and
XPath do the rest.
Lucene indexing
The other part of the challenge is to find a smart and reusable way to analyze
documents, extract statistically relevant tokens from them, and make them
available to Service Registry. We'll use Apache Lucene to do this. Specifically,
we'll use a collateral behavior of the main indexing function called token
extraction.
When Lucene processes a document for indexing, it performs an analysis (or
tokenization) that may be complex and language-dependent, but creates a map of
tokens and their frequencies in the document. We'll intercept this phase of the
Lucene indexing and extract the tokens using two different implementation
strategies but producing the same result.
One thing to keep in mind is that our solution requires us to enable the indexing
of many different business documents, ranging from PDFs to Word and Excel
documents. Lucene can index only plain text documents, so we need a way to extract
the plain text from a rich document format like those above and feed it into
Lucene. (Of course, XML files, being plain text, are not required to be
transformed in any way.) To do this extraction, we'll leverage two other open
source projects:
- Apache POI to handle Microsoft™ documents, like those produced by
Word and Excel
- PDFBox to handle PDF documents
Figure 2 shows the general indexing process:
Figure 2. Indexing process
The same logical scheme is applied to both the RAM- and Spring-based
implementations. In the second scenario, the document translator and indexing
engine components and their subcomponents are externally defined through a
descriptor and are instantiated using Spring.
Lightning fast
Lucene in-memory indexing
The Lucene in-memory approach combines RAM-based storage with a very fast
document analyzer that trades ease of customization for a highly optimized
implementation. Basically, the RAM-based storage means that the file
representation of the indexes is kept in RAM and therefore can't be shared among
different applications. These indexes remain private and thus avoid the need for
elaborate access control strategies. The highly optimized implementation decreases
the customization flexibility of Lucene but provides a very fast indexing engine.
If you don't need to share the indexes among different applications, you don't
care about the ability to externally declare the structure of the indexing engine,
and you are only interested in providing full-text search to Service Registry,
this is the strategy for you.
Otherwise, keep reading.
The order is coming and its name is
Spring
One of the drawbacks of the previous solution is that, from a coding perspective,
there's no easy way to provide flexibility without impacting performance and code
complexity. Basically, the transformation flow is hard-coded and can't be easily
changed without providing some sort of flexibility framework.
A very interesting feature of the Spring Framework is its implementation of the
Inversion of Control principle (also known as the Hollywood
Principle, as in "Don't call me, I'll call you"), which states that, instead
of wiring components through explicit references into the code, you can simply
structure the components by declaring their dependencies through interfaces and
types, and then let some other component resolve the dependencies and inject the
requested component into the requesting one.
From a coding perspective, if you have an explicit dependency like:
class A {
public void theMethod() {
B dependency = new B();
dependency.someOtherMethod();
}
}
|
you can transform it into something like:
class A {
public void theMethod(B dependency) {
dependency.someOtherMethod();
}
}
|
In this way, you can make the implicit dependency of the first snippet explicit
through the method interface and shift the responsibility of creating and
injecting the real B object to another component, which can be driven by a
configuration file instead of by the code itself. Using this approach of simply
altering a configuration file (in the Spring case, an XML descriptor), you can
change the component's implementation to use a different indexing strategy.
Without going into great detail about the Spring Framework and its programming
model, let's recap the basic logical work flow:
- Create an XML file, describing the beans, or components, you want to use and
the way you want to use them.
- In your code, create an instance of either the
ApplicationContext or
BeanFactory class, and pass the reference to the XML
descriptor.
- Start getting the beans.
Another important piece of the picture is that the Spring Framework provides an
integration module and a programming model specific to Lucene integration that can
greatly facilitate the development of Lucene applications by providing a set of beans
that are ready to use in conjunction with the declarative approach of the Spring
Framework.
In our case, we'll use Spring to declare the indexing engine, the storage type
(RAM or file-based), the document translator (and the way it recognizes document
types, in order to be able to dispatch the document to the appropriate handler)
and the template bean, which, in Spring parlance, shields Lucene API complexities
from the Spring user.
Create the plug-in
In this section, you'll learn step-by-step how to code the integration plug-in,
as well as learn the technical aspects of the different frameworks involved.
The glue code
Since we'll provide two implementations of the same indexing function, we'll hide
those implementations behind a factory
(DocumentAnalyzerFactory) that exposes a single, common
interface (DocumentAnalyzer), as shown here:
package com.ibm.luceneintegration;
public interface DocumentAnalyzer {
public static final int DEFAULT_MAX_TERMS = 50;
String[] getMostFrequentTermsAsArray(String name,byte[] content)
throws Exception ;
String[] getMostFrequentTermsAsArray(String name,byte[] content, int
maxTerms) throws Exception ;
String getMostFrequentTerms(String name,byte[] content) throws
Exception ;
String getMostFrequentTerms(String name,byte[] content, int
maxTerms) throws Exception ;
}
|
The interface provides four variations of the same method, which logically
translates binary content into a list of the most frequent tokens, eventually
limiting the length of the list.
The name parameter guesses the document type based on
the extension. It is possible to guess the type using different techniques, but
for the sake of simplicity, we'll stick with the name-based guess.
The validation plug-in
The Service Registry validation plug-in will use the DocumentAnalyzerFactory to
select a document analyzer, use the analyzer to parse each document and retrieve
the token list, and then inject the list into an attribute of the XML description
metadata associated with the document to enable the XPath mixed query. You'll
learn two possible ways to implement this, but you could easily code other
implementations that would fit into the plug-in infrastructure.
Figure 3 shows a simplified version of the plug-in code:
Figure 3. Plug-in code
This plug-in shows the implementation of one of the CRUD methods (3), just to
illustrate the code.
If the passed object is of type Document, which means
that it's a binary object rather than an XML-based technical document), it can be
passed to the injectProperty method responsible for
handling it. This method (2) retrieves the content and the document's name, then
uses the document analyzer implementation (through the factory) to calculate the
set of most frequent tokens, and finally calls the
setPropertyValue method (1) to set the value of the
keywords attribute to a string representation of the
keywords set.
The lightning fast, RAM-based, Lucene
implementation
This implementation is based on the Lucene core and a contributed package, namely
Lucene Memory, which provides the in-memory indexing feature. As you can see in
Figure 2, the indexing work flow includes:
- Document tokenization, which is the phase that transforms a binary document
into plain text, containing most of the document's content
- An indexing engine, which is responsible to statistically analyze the plain
text (derived from the binary document) and compute a list of tokens, sorted by
frequency
In our implementation, the first step is realized through the
Converter class (which, for the sake of brevity, is not
shown here, but is available in the included source). The
Converter class extracts the extension from the document name (probably derived
from the file name and uses it to dispatch the document to the appropriate text
extractor:
- Apache POI for Microsoft Excel and Word documents
- PDFBox for PDF documents
- Java Swing for RTF documents
The AnalyzerUtil class (from the Lucene Memory
package) is used to extract the most frequent terms from the plain text. This
class is responsible for creating on-the-fly, volatile indexes, directly in RAM.
This enables very fast indexing operations, but the index is not accessible for
further operations.
The Spring-based implementation
The Spring implementation is based on the use of several components, each
providing a specific feature:
- The Spring Framework to leverage the Inversion of Control pattern and the
declarative approach
- The Spring Modules for the Spring and Lucene integration
- Lucene Core, for the indexing engine
- PDFBox, for PDF document handling
- Apache POI for Microsoft Word and Excel document handling
The Spring Framework is configured using an XML descriptor document like that
shown in the simplified example in Figure 4:
Figure 4. Sample Spring
configuration file conf.xml
The confiiguration file can be logically split into four major pieces:
- Storage beans, which define the storage areas for indexes. In this case,
there are two: RAM-based (session only) and file-system-based (persistent).
- An indexing engine, which defines the strategy to use for text analysis (the
analyzer) and the index storage to use to record indexes.
- Document transformations, which specify a strategy to use to guess the
document type (in this case, it's based on the extension, but you can change
this method), and a map that links extensions to document handlers that are
ultimately responsible for transforming specific document types into plain text
- A template, which is the object façade that wraps the Lucene APIs into a
simpler programming model and is linked to the indexing engine.
These beans are tied together using the dependency Injection mechanism:
- The indexing engine (an
IndexFactory bean) has a
property of type Directory, which can be saturated by
one of the two storage beans, both of type Directory.
- The template component has a property of type
IndexFactory, a descendent of the indexing engine,
which can saturate it.
When you put references as the value of a bean's property, you are really calling
the dependency injection without having to write some sort of glue code to tie
components together.
As you can probably see, the document handler manager bean is not directly linked
to any bean, though it should be linked to the template. This is due to a defect
in the Spring modules component code.
A Spring object (ApplicationContext or BeanFactory) can use this configuration
file to create the beans, inject the dependencies and many other things not shown
here.
As you can see, the code for the Spring-based implementation, basically uses
ApplicationContext to initialize the Spring components
and retrieve the beans to be used:
ctx = new ClassPathXmlApplicationContext("conf.xml");
mgr = (DocumentHandlerManager)ctx.getBean("documentHandlerManager");
template = (LuceneIndexTemplate)ctx.getBean("template");
|
For each document processed, the document handler manager bean is used to
automatically guess the document type and route it to the appropriate document
handler for plain text extraction, as shown here:
Document doc = mgr.getDocumentHandler(name).getDocument(new HashMap(),
new ByteArrayInputStream(content));
Field contents = doc.getField("contents");
Field myContents = new Field("contents",contents.stringValue(),
Store.YES,Index.TOKENIZED,TermVector.YES);
doc.removeField("contents");
doc.add(myContents);
|
The first line does the magic, using the externalized document handler manager
configuration. The power of Spring is that the plug-in code is completely unaware
of which document types it can handle and this set can be easily changed
declaratively, with no impact on the existing plug-in code.
The remaining lines work around a default behavior of the document handler
manager, which does not create the Document object as able to be indexed, but only
to be parsed, which means it wouldn't be possible to retrieve the list of the most
frequent terms. Therefore, the code retrieves the content field and then creates a
new one that is ready for indexing.
Next the document is sent to the Template object, which does the real indexing,
using the indexing engine specified in the configuration file. The template
programming model states that, once a document has been inserted into it, the
index can be read using a callback method. In our implementation, this method
accesses the index, retrieves the terms and frequencies vector, and uses it to
feed a SortedMap, which guarantees the sorting over the frequency value, allowing
us to extract a subset of terms already ordered by frequencies, as shown here:
int docNumber = reader.termDocs().doc();
TermFreqVector[] freqs = reader.getTermFreqVectors(docNumber);
TermFreqVector vector = freqs[0];
int[] frequencies = vector.getTermFrequencies();
String[] terms = vector.getTerms();
SortedMap<Integer, String> map = new
TreeMap<Integer,String>(Collections.reverseOrder());
for (int i=0; i<vector.size(); i++) {
map.put(new Integer(frequencies[i]), terms[i]);
}
|
Once you have the ordered map, you can simply extract the limited set of terms,
eventually writing them out as a single string (done by the glue integration
code).
Deploy the plug-in
In this section, you'll learn how to deploy the plug-in in Eclipse, configure the
application server, load the plug-in into Service Registry, and test the solution.
Import the plug-in into Eclipse
Once you've downloaded the complete plug-in source project, contained in
wsrrplugin.zip, bundled with all dependencies except the
Service Registry run-time client, you can import it into an Eclipse installation
by completing the following steps:
- Launch Eclipse, pointing to an empty workspace.
- In Eclipse, select File => Import.
- In the Import dialog, select General => Existing Projects into
Workspace, and click Next, as shown in Figure 5.
Figure 5. Select import type
- In the Import Projects dialog, shown in Figure 6, browse to the downloaded
archive file. Eclipse automatically selects the only project contained in that
file.
Figure 6. Select archive file
- Click Finish to build the project. The final workspace should look
something like Figure 7.
Figure 7. Initial workspace
- Now generate the binary plug-in JAR file by right-clicking
WSRRplug-inExport.jardesc and selecting Open JAR
Packager.
- In the JAR File Specification dialog, choose the appropriate JAR filename and
click Finish, as shown in Figure 8.
Figure 8. Specify the JAR file
Configure the application server
In order to run the plug-in, you need to customize the configuration of the
application server, especially with respect to the classpath, as described in this
section. You can use the solution described to test the plug-in, but you shouldn't
use it in a production environment because of the kind of classpath manipulation
required.
To test the solution, you'll need the following archive files, which are
available for download from this article:
- wsrrplugin.zip, which contains the plug-in code
- WSRRFullTextDependencies.zip, which contains the run-time dependencies. Unzip
this file to a directory on the machine where Service Registry is running. For
example, on a Linux platform, you can create the directory
/usr/local/WSRRplug-inDependencies and unzip the dependencies archive with the
following command:
unzip <dependency archive> -d /usr/local/WSRRplug-inDependencies
- workspace.zip, which contains the workspace files
To customize the application server, do the following:
- If the application server hosting Service Registry is not yet running, start
it. For example, on a typical Linux installation of Service Registry, enter the
command:
/opt/IBM/WebSphere/AppServer/bin/startServer.sh server1
|
- Point your browser to http://localhost:9060/ibm/console, and select Servers
=> Application servers from the left navigation, then select
server1 in the right pane.
- In the right pane, select Server Infrastructure => Java and Process
Management => Process Definition => Java Virtual Machine.
The Java Virtual Machine Configuration dialog displays, as shown in Figure 9.
- Log out and restart the server by entering the commands:
/opt/IBM/WebSphere/AppServer/bin/stopServer.sh server1
/opt/IBM/WebSphere/AppServer/bin/startServer.sh server1
|
The application server is now configured to host the plug-in, which you can
deploy into Service Registry.
Load the plug-in into Service
Registry
You should now have your plug-in JAR file, whether you downloaded it or created
it from the Eclipse workspace.
- Open the Service Registry Web user interface by pointing your browser to
http://localhost:9080/ServiceRegistry.
- Select Perspective => Configuration, as shown in Figure 10.
Figure 10. Select
Configuration perspective
- In the left navigation pane, select Active Configuration Profile =>
Plug-ins => JARs, then click Load JAR Plug-In in the right
pane, as shown in Figure 11.
Figure 11. Load JAR plug-in
- In the next dialog, browse to the plug-in file, specify a name, such as
WSRRLuceneplug-in, and click OK.
- When the Details pane displays, select Validation Properties in the
left pane.
- In the right pane, click Validation properties plug-in
(ValidationProperties), as shown in Figure 12.
Figure 12. Select validation
properties
- In the right pane, select the Content tab, and look for the following
text:
validators=com.ibm.sr.api.SRTemplateValidator,com.ibm.sr.api.SRBusinessModelValidator
|
Append a comma and the following value (which is the class name of our new
integration validator):
com.ibm.luceneintegration.wsrr.FullTextValidator
|
- Click OK.
The validator is now in place and hooked to Service Registry. The next document
you process will contain the new attribute keywords,
allowing you to do full-text search in the XPath queries.
Test the solution
To make sure everything works as expected, do the following:
- Switch to the Service Registry Administrator perspective.
- In the left navigation pane, select Service Documents => Other
Documents, then click Load Documents in the right pane.
- Browse to a document you want to search (PDF, Excel or Word), enter a
namespace, description, and version for the document, and click OK, as
shown in Figure 13.
Figure 13. Import document
- Click Finish in the confirmation message.
- When the document is loaded, you should see a summary page with a link to the
document. Click this link to go to a document description page.
- On the document description page, click the Properties link. The
Properties dialog displays, as shown in Figure 14.
Figure 14. Document
properties with the new keywords property
You should see the new keywords property, containing
the list of the most relevant tokens found in the document.
Now for the real fun! Try out a
search
The Service Registry Web UI has a query wizard that walks you through creating an
XPath query, but it currently doesn't accept wildcards, which prevents us from
using it to test our new search feature. However, we can work around this
limitation by leveraging the Service Registry saved searches feature.
- In the Service Registry Web UI, select Queries => Query Wizard
in the left pane, then select All Entities and click Next in the
right pane, as shown in Figure 15.
Figure 15. Open the Query
wizard
- In the next dialog, specify
Dummy for the property
name and Dummy for the property value, then click
Next, then Finish.
- The query will likely return no results, but you will be given the option to
save the query, as shown in Figure 16. In the Save this Search section,
specify a name for the search, such as
mysearch, and
click Save.
Figure 16. Save the query
- In the left pane, select MyService Registry => My Saved Searches
=> mysearch, as shown in Figure 17.
Figure 17. Select saved
search
- In the Details dialog, select Properties under Additional
Properties and click queryExpression.
- Enter an XPath query, using the matches predicate to exploit the full-text
search capability, and click OK. For example, to search for the specific
IBM Redbook
Patterns: SOA Design Using
WebSphere Message Broker
and WebSphere ESB
that we imported in the previous section, you could use the following XPath
query in the
executeQuery property:
/WSDL/GenericDocument[matches(@keywords,".*soa.*")]
|
- Re-run the query by selecting MyService Registry => My Saved
Searches => mysearch and clicking Run Search, as shown in
Figure 18.
Figure 18. Re-run the saved
query
You can use this process to test the integration plug-in using both ontology and
full-text queries, for example:
/WSRR/GenericDocument[classifiedByAnyOf(.,’<classification URI>’)
and matches(@keywords,’.*<my keyword>.*’) ]
|
Summary
In this article, you learned how to create a Service Registry plug-in to extend
the search functionality. Specifically, you saw how you can use the open source
Apache Lucene project to add indexing features to Service Registry and enable you
to do full-text searches.
You also learned how to use the open source Spring Framework project to
externalize some structural aspects of the plug-in code, adding flexibility and
the ability to add and change features in a declarative way. Through the companion
Spring Modules project, you learned a simplified programming model to interact
with Lucene in an easier, Spring-enabled way.
You can use the techniques you've learned to add many other features, such as the
ability to handle languages other than English and determine those languages from
specific document properties or heuristics, or to integrate semantic searches
based on the Lucene query language and its ability to integrate with Wikipedia and
WordNet®. The file-based storage, which makes the indexes persistent, also allows
you to come back to them and perform other queries, even outside of Service
Registry.
Acknowledgments
The author would to thank Tim Baldwin and the WebSphere Service Registry and
Repository development team for their help in developing and verifying the content
of this article.
Downloads | Description | Name | Size | Download method |
|---|
| Plug-in archive file | wsrrplugin.zip | 9KB | HTTP |
|---|
| Dependencies archive file | WSRRFullTextDependencies.zip | 11.8MB | HTTP |
|---|
| Workspace archive file | workspace.zip | 17.5MB | HTTP |
|---|
Resources Learn
Get products and technologies
-
Lucene: Get complete information on
Lucene, including downloads.
-
Spring Framework: Get complete
information on the Spring Framework, including downloads.
-
Spring Modules: Get complete
information on Spring Modules, including downloads.
-
Spring Modules: Get complete information on
Eclipse, including downloads.
About the author  | 
|  |
Claudio Morgia is an SOA Architect in IBM Software Group in Italy. He
works with several large Italian clients, providing architectural guidance for
their SOA journeys, especially in the BPM, service management and governance
areas. Claudio previously worked in the IBM Tivoli® Lab in Rome as a
software designer, focusing on J2EE-based solutions for the IBM WebSphere
platform. You can reach Claudio at
claudio.morgia@it.ibm.com. |
Rate this page
|  |