Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Semantic search in WebSphere Information Integrator OmniFind Edition: The case for semantic search

Part 1: Moving beyond keyword search

Thomas Hampp (thomas.hampp@de.ibm.com), Senior Software Engineer, IBM Germany
Thomas Hampp photo
Thomas Hampp is the leading expert on Text Analysis in IBM Software Group with special focus on text analysis, text mining and language processing. He has been working in these areas for several years, both in IBM research and development. During his years in research, he developed and patented the first version of a text analysis framework that has now developed into UIMA. He is a member of the architecture boards for UIMA, WebSphere Information Integrator OmniFind, and Content Management. His current responsibility is defining and extending the integration architecture of UIMA into WebSphere Information Integrator OmniFind.
Alexander Lang (alexlang@de.ibm.com), Software Engineer, IBM Germany
Alexander Lang photo
Since joining IBM in 2000, Alexander Lang has worked on several IBM products in the content management, search, and text analysis space. For the current release, he developed the mapping of UIMA annotations to semantic search index structures, and led the team to support customer-defined synonyms for search terms. Besides his architecture and team lead responsibilities, Alexander works with customers and IBMers to identify the right mix of text analysis involved in custom semantic search solutions, and is investigating the use of taxonomies and ontologies in text search.

Summary:  Been searching for that customer phone number over and over again in your emails, but queries like "phone number" did not work? If so, semantic search would have made your life easier. Semantic search is a new search paradigm that overcomes the limitations of standard keyword search. This article describes the semantic search capabilities of IBM®'s enterprise search offering, WebSphere® Information Integrator OmniFind™ Edition Version 8.2.2, which integrates IBM's Unstructured Information Management Architecture (UIMA). It describes the key concepts involved, outlines a workflow for designing and implementing semantic search solutions, and introduces an upcoming tutorial for deploying semantic search solutions in OmniFind.

View more content in this series

Date:  05 Aug 2005
Level:  Introductory

Activity:  6309 views
Comments:  

The need for semantic search

Every day, employees produce an abundance of documents in their everyday work: customer emails, business reports, product manuals – in all shapes and sizes, and stored on various content backends. To make sure that this information can be readily put to use, companies need to provide their employees an integrated access to this information, using text search.

This is where WebSphere Information Integrator OmniFind Edition comes into play: IBM's search offering that is designed to support enterprise scale search requirements over a wide variety of content sources. WebSphere Information Integrator OmniFind Edition provides extensive capabilities for searching diverse collections of business information from a single point of access, delivering highly relevant search results within sub-second response time while scaling to millions of documents and thousands of users.

For end users, this "single point of access" typically means typing one or more keywords into a search box and getting back results from various content backends. Keyword search is the search paradigm of the "big" Internet search engines. It is easy to learn and users are very familiar with it. However, it has some limitations:

  • I have to call the customer -- How can I find the e-mail with Alex’s phone number in it?
    He never uses the term "phone number," but writes things like "you can reach me at 555-641-1805."
  • I have to prepare a quality report -- How can I find the car repair reports that deal with brake problems in the north San Francisco area?
    The descriptions from the repair shop only talk about things like "shoe adjusted due to leakage in hydraulics." Moreover, they only contain the street address of the repair shop.
  • I want to research for a new drug -- How can I find documents that talk about a certain protein and any kind of disease in the same paragraph?
    There are twenty different names of the protein in the literature. Moreover, the documents may not contain the term "disease" at all, only the name of the disease itself.

All of the above examples share a common thread:

  • Keyword search can't search for information about higher-level concepts like people's names, telephone numbers, or car parts.
  • Keyword search can’t express relationships like "must occur in the same paragraph" or "the action caused the problem versus the action fixed the problem."

Semantic search is a new way of searching that addresses the problems we've just described. It allows the searcher to specify concepts and relationship within a search query, which are detected using text analysis. This article series will show you how to design and deploy a semantic search solution, using the capabilities of WebSphere Information Integrator OmniFind Edition Version 8.2.2, and shows how such a solution enhances the effectiveness of text search.


Search and text analytics

Inside and outside IBM, a lot of research is being done on improving the keyword search paradigm. One key goal is to incorporate more knowledge about linguistics and about the domain the search solution is operating in. Technology that includes and applies this knowledge is called Text Analytics. Right from the first version, WebSphere Information Integrator OmniFind Edition included linguistic knowledge for the following tasks:

  • Determine the language of a document in order to narrow down search results to a particular language.
  • Determine the base form of a word in order to find a document containing mice when searching for mouse.
  • Determine sentences and paragraphs to provide a meaningful document summary in the search result.

WebSphere Information Integrator OmniFind Edition Version 8.2.2 significantly expands these capabilities in the following ways:

  • Run text analytics technology within WebSphere Information Integrator OmniFind Edition that extracts important domain concepts (like people's names), and relationships (like “this person works in that department”).
  • Provide semantic search capabilities that allow specifying these concepts and relationships in a search query.

The following diagram puts semantic search into the context of different search paradigms.


Figure 1. Search paradigms
search paradigms

Typically, the above technologies build on each other:

  • Semantic search builds on elements and operators (such as + and -) of keyword search, and adds the capability to search for concepts and relationships. To do that it introduces additional elements; for example, an XPath-style query syntax. Typically, a search application hides this query syntax from an end user.
  • Natural Language Search tries to find documents based on an end user’s question. This avoids the complexity of the semantic search query syntax, but may misinterpret the user question. It is often implemented by doing text analysis on the query and automatically translating a natural language query like "Alex phone number" into a semantic query like "Alex <phone_number>". This is done, for example, by detecting concepts in the user query, and mapping them to concepts found in the index.
  • Question Answering returns the part of a document that includes the relevant answer. It may also consolidate different facts, coming from different documents, into a single answer. In the example above, a question answering system would return the actual phone number, not just the document that contains it. Question Answering is often implemented by taking the results of Natural Language Search -- which still returns a regular set of documents -- and doing post processing on the result set of documents to extract the actual answer from the documents in the result set.

WebSphere Information Integrator OmniFind Edition is an enterprise search engine combining extensible text analytics with semantic search capabilities. Therefore, it can serve as a platform for even more advanced search solutions that involve natural language search or question answering.

Semantic search syntax

Of course, text analytics alone does not provide semantic search. IBM's Search and Indexing API (SIAPI) provides support for two new kinds of search syntax, XML fragments and a subset of XPath. Subsequent articles in this series will go into the details of the query syntax. Basically, both XML fragments and XPath queries allow to specify concepts, attributes of these concepts, and relationships between concepts within a single query. As seen in the first example below, the search query can contain both keywords and semantic search terms. The semantic search terms are marked as a so-called opaque term: @xmlf2 is used for fragment queries, and @xmlp is used for XPath queries. To give you a feeling for what such queries look like, here are some examples:

  1. "Documents that contain Alex' phonenumber"
    XPath query:
    Alex @xmlp::'phonenumber'
    XMLFragment query:
    Alex @xmlf2::'<phonenumber/>'
  2. "Reports about brake problems in the north San Francisco area"
    XPath query:
    '@xmlp::'brakeProblem' @xmlp::'location[@city="San Francisco" and @orientation="north"]''
    XML Fragment query:
    @xmlf2::'<brakeProblem/> <location city="San Francisco" orientation="north"/>
  3. "Articles that talk about a certain protein and a disease in the same paragraph"
    XPath query:
    @xmlp::'paragraph[protein ftcontains ("BIKE") and disease]'
    XML Fragment query:
    @xmlf2::'<paragraph><protein>BIKE</protein><disease/></paragraph>'

We do not expect most search end users to type in these queries themselves. A search application will most likely hide the query complexity behind UI elements like fields or drop down lists. This is similar to database applications today - not many people type in SQL directly, but just enter the values they are interested in using a (structured) search interface.

The following screenshot is an example for a simple semantic search application on insurance accident reports. Here, text analysis is used to extract concepts like cars and their make, license plates and people involved in car accidents. The user can now search for specific concepts, like accident reports that contain a person named "Lang," or use the checkboxes next to the concepts for queries like: "Reports that included males above 32 years of age, and a car with make 'Audi.'"


Figure 2. A simple semantic search application
A simple semantic search application

WebSphere Information Integrator OmniFind Edition and the Unstructured Information Management Architecture (UIMA)

Text analytics plays a crucial role when stepping beyond keyword search. Like the search paradigms, text analysis technologies typically build on each other. For example, a text analysis step that generates a document summary will improve its results if words and sentences in the document have been identified before.

IBM has developed an architecture, the Unstructured Information Management Architecture (UIMA), that allows you to combine different text analysis steps, developed by different teams or companies. This way, the appropriate steps can be "plugged together" into advanced analysis capabilities that detect and locate information of interest in document collections.

The analysis logic component developed using UIMA is called an annotator. Each annotator performs specific linguistic analysis tasks (for example, detecting cars and their make) . The result of the analysis is modeled in so-called annotations. An analysis engine is the container for any number of annotators, which typically build on each other’s annotations.


Figure 3. Text analysis with UIMA annotators
Text analysis with UIMA annotators

The example above shows some of the processing involved in detecting brake problems in a car repair report. The base processing (not shown here) is identifying the individual words of the sentence, shown in the green boxes at the bottom in the green boxes at the bottom. Based on this processing, three annotators contribute different information. They do this by adding annotations (like "Brake Part"), which cover certain areas of the input text. As annotators run in sequence, they build on one other's output:

  1. The Part Annotator just identifies brake parts, maybe with a simple lookup in a parts list.
  2. The Problem Annotator may also have a list for typical problem indicators and defects. Additionally, it can identify brake problems, by checking whether a Brake Part annotation occurs next to a Problem Indicator annotation. It then creates a Brake Problem annotation in the CAS to record this fact. Note that the detection rule involved here can be very generic, as it doesn't need to be concerned with the actual terms like "shoe" or "malfunction."
  3. Finally, the Problem-cause Annotator looks at the Brake Problem and Defect annotation, and uses the "due to" construct to identify a Problem-Cause relationship. This relationship is also modeled as an annotation, which has two attributes referencing the Brake Problem and Defect annotations

WebSphere Information Integrator OmniFind Edition incorporates and exposes the UIMA framework. WebSphere Information Integrator OmniFind Edition comes with a base stack of annotators for language identification, word segmentation, and categorization, which can be extended by third party annotators to perform deeper analysis for specific languages and domains. Such annotators can be developed using the UIMA Software Development Kit (SDK) . The UIMA SDK allows you to build and test your own annotators which are then integrated into the WebSphere Information Integrator OmniFind Edition environment.

The following diagram shows how UIMA integrates WebSphere Information Integrator OmniFind Edition base annotators, third party annotators and semantic search indexing to build a semantic search solution:


Figure 4. OmniFind components involved in semantic search
OmniFind components involved in semantic search

In this diagram, documents are crawled by WebSphere Information Integrator OmniFind Edition from several sources shown at the very left. The natural language content is extracted from the different document formats (like PDF, Microsoft Office™, XML, HTML, and so on) and sent to UIMA annotators for analysis. First, the sequence of WebSphere Information Integrator OmniFind Edition base annotators runs on each document. In the example, custom annotators build on the base result by identifying named entities (like person or product names) and their relationships (for example, "shoe" is a "brake part"). The result of the analysis is a document enriched with various annotations. This document is then stored in the search index. A normal keyword search index only holds information about the words in the document. The WebSphere Information Integrator OmniFind Edition semantic search index additionally contains information about the extracted concepts which can then be used in semantic search. Hence, in the example above, users could now type in @xmlp::'brakePart' and get documents back talking about some brake shoe.


Building and deploying a semantic search solution

A semantic search solution is typically tailored to some domain or industry to provide the most benefit. Some examples of solution areas are:

  • Manufacturing
    • Automotive Quality Early Warning
    • Customer Support and Self Service
    • Media Monitoring and Competitor Analysis
  • Financial Services
    • Anti-Money Laundering
    • Insurance Fraud Analysis
    • Broker Self Service
  • Government
    • Intelligence for Anti-Terrorism and Law Enforcement
    • Case Management for Law Enforcement and Social Services

Many of these solution areas rely on domain specific annotators and application code that is not part of WebSphere Information Integrator OmniFind Edition. For example, a solution for Intelligence for Law Enforcement could combine annotators that spot criminal events, persons, places and times involved. The results can be made available through semantic search, perhaps combined with statistics evaluations that compute trends and associations in events. We do not expect most customers to write such annotators or applications. UIMA is an open platform that fosters an ecosystem of industry and academic partners which can provide annotators and ready-to-use solutions for the above domains.

Consequently, the focus of this article series is not on the development of text analysis steps. This is covered in depth, together with a lot of samples in the UIMA SDK documentation. Our focus will be the steps involved in developing a semantic search solution within WebSphere Information Integrator OmniFind Edition. The next article in this series will be a tutorial showing the end-to-end steps involved in deploying a semantic search solution within WebSphere Information Integrator OmniFind Edition. Subsequent parts will cover the use of pre-structured information like XML for semantic search, and go into the details of our semantic search capabilities. We will build these parts around everyday tasks in a police station, where police officers need to access police reports quickly and easily. Each part of the tutorial will discuss the steps involved in building such a solution, which are outlined below:

Plan and design

  1. Understand the users' search needs.
    What are the concepts and relationships needed in a particular search task? For example, product and employee names may be needed to enhance "general purpose" search on a pharma company internal website, while people in the Research & Development need to be able to use variants of drug names and "drug-causes-cure" relationships.
  2. Understand the document set.
    What are the kind of documents users are dealing with? Do the documents already contain a structure that could be exploited in the search (for example, content within XML tags)?
  3. Clarify the technology options.
    What kind of text analysis is needed to extract the kind of information from document collections to be searched?
  4. Map text analysis results to semantic search.
    Using the information gathered in the steps above, determine which text analysis results should be accessible using semantic search.
  5. Design the semantic search application.
    How should the search user interact with the additional capabilities of semantic search? What should the user interface look like?

Buy or build text analysis technology

  1. Use the UIMA SDK to develop annotators for each analysis step. Alternatively, obtain text analysis technology from IBM or other companies. Embed the annotators in analysis engines and perform tests.

Deploy

  1. Deploy your text analysis components as an analysis engine archive within WebSphere Information Integrator OmniFind Edition.
  2. Associate one or more document collections with your analysis engine.
  3. For each collection, map the text analysis results to the semantic search index.
  4. If required, set up your custom semantic search application, for example deploy your browser-based search user interface into an application server.

Run

  1. Crawl, parse, and index your documents in your semantic search collection, just like in any keyword based collection.
  2. Start the search and exploit the semantic search capabilities.

Expand and maintain

  1. Deploy enhanced text analysis components that detect additional concepts.
  2. Adapt your search application to make use of the additional information.

The sample scenario in the tutorial (see Resources) is enhanced iteratively. Each part focuses on a specific aspect of the above workflow, and will make the police officer’s work easier by enhancing the search capabilities of the solution. So, even if you have just got a speeding ticket, we hope you will join us in looking beyond keyword search!


Resources

About the authors

Thomas Hampp photo

Thomas Hampp is the leading expert on Text Analysis in IBM Software Group with special focus on text analysis, text mining and language processing. He has been working in these areas for several years, both in IBM research and development. During his years in research, he developed and patented the first version of a text analysis framework that has now developed into UIMA. He is a member of the architecture boards for UIMA, WebSphere Information Integrator OmniFind, and Content Management. His current responsibility is defining and extending the integration architecture of UIMA into WebSphere Information Integrator OmniFind.

Alexander Lang photo

Since joining IBM in 2000, Alexander Lang has worked on several IBM products in the content management, search, and text analysis space. For the current release, he developed the mapping of UIMA annotations to semantic search index structures, and led the team to support customer-defined synonyms for search terms. Besides his architecture and team lead responsibilities, Alexander works with customers and IBMers to identify the right mix of text analysis involved in custom semantic search solutions, and is investigating the use of taxonomies and ontologies in text search.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, WebSphere
ArticleID=91131
ArticleTitle=Semantic search in WebSphere Information Integrator OmniFind Edition: The case for semantic search
publish-date=08052005
author1-email=thomas.hampp@de.ibm.com
author1-email-cc=thomas.hampp@de.ibm.com
author2-email=alexlang@de.ibm.com
author2-email-cc=alexlang@de.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers