Every day, employees produce an abundance of documents in their everyday work: customer emails, business reports, product manuals – in all shapes and sizes, and stored on various content backends. To make sure that this information can be readily put to use, companies need to provide their employees an integrated access to this information, using text search.
This is where WebSphere Information Integrator OmniFind Edition comes into play: IBM's search offering that is designed to support enterprise scale search requirements over a wide variety of content sources. WebSphere Information Integrator OmniFind Edition provides extensive capabilities for searching diverse collections of business information from a single point of access, delivering highly relevant search results within sub-second response time while scaling to millions of documents and thousands of users.
For end users, this "single point of access" typically means typing one or more keywords into a search box and getting back results from various content backends. Keyword search is the search paradigm of the "big" Internet search engines. It is easy to learn and users are very familiar with it. However, it has some limitations:
- I have to call the customer -- How can I find the e-mail with Alex’s phone number in it?
He never uses the term "phone number," but writes things like "you can reach me at 555-641-1805."
- I have to prepare a quality report -- How can I find the car repair reports that deal with brake problems in the north San Francisco area?
The descriptions from the repair shop only talk about things like "shoe adjusted due to leakage in hydraulics." Moreover, they only contain the street address of the repair shop.
- I want to research for a new drug -- How can I find documents that talk about a certain protein and any kind of disease in the same paragraph?
There are twenty different names of the protein in the literature. Moreover, the documents may not contain the term "disease" at all, only the name of the disease itself.
All of the above examples share a common thread:
- Keyword search can't search for information about higher-level concepts like people's names, telephone numbers, or car parts.
- Keyword search can’t express relationships like "must occur in the same paragraph" or "the action caused the problem versus the action fixed the problem."
Semantic search is a new way of searching that addresses the problems we've just described. It allows the searcher to specify concepts and relationship within a search query, which are detected using text analysis. This article series will show you how to design and deploy a semantic search solution, using the capabilities of WebSphere Information Integrator OmniFind Edition Version 8.2.2, and shows how such a solution enhances the effectiveness of text search.
Inside and outside IBM, a lot of research is being done on improving the keyword search paradigm. One key goal is to incorporate more knowledge about linguistics and about the domain the search solution is operating in. Technology that includes and applies this knowledge is called Text Analytics. Right from the first version, WebSphere Information Integrator OmniFind Edition included linguistic knowledge for the following tasks:
- Determine the language of a document in order to narrow down search results to a particular language.
- Determine the base form of a word in order to find a document containing mice when searching for mouse.
- Determine sentences and paragraphs to provide a meaningful document summary in the search result.
WebSphere Information Integrator OmniFind Edition Version 8.2.2 significantly expands these capabilities in the following ways:
- Run text analytics technology within WebSphere Information Integrator OmniFind Edition that extracts important domain concepts (like people's names), and relationships (like “this person works in that department”).
- Provide semantic search capabilities that allow specifying these concepts and relationships in a search query.
The following diagram puts semantic search into the context of different search paradigms.
Figure 1. Search paradigms
Typically, the above technologies build on each other:
- Semantic search builds on elements and operators (such as + and -) of keyword search, and adds the capability to search for concepts and relationships. To do that it introduces additional elements; for example, an XPath-style query syntax. Typically, a search application hides this query syntax from an end user.
- Natural Language Search tries to find documents based on an end user’s question. This avoids the complexity of the semantic search query syntax, but may misinterpret the user question. It is often implemented by doing text analysis on the query and automatically translating a natural language query like "Alex phone number" into a semantic query like "Alex <phone_number>". This is done, for example, by detecting concepts in the user query, and mapping them to concepts found in the index.
- Question Answering returns the part of a document that includes the relevant answer. It may also consolidate different facts, coming from different documents, into a single answer. In the example above, a question answering system would return the actual phone number, not just the document that contains it. Question Answering is often implemented by taking the results of Natural Language Search -- which still returns a regular set of documents -- and doing post processing on the result set of documents to extract the actual answer from the documents in the result set.
WebSphere Information Integrator OmniFind Edition is an enterprise search engine combining extensible text analytics with semantic search capabilities. Therefore, it can serve as a platform for even more advanced search solutions that involve natural language search or question answering.
Of course, text analytics alone does not provide semantic search. IBM's Search and Indexing API (SIAPI) provides support for two new kinds of search syntax, XML fragments and a subset of XPath. Subsequent articles in this series will go into the details of the query syntax. Basically, both XML fragments and XPath queries allow to specify concepts, attributes of these concepts, and relationships between concepts within a single query. As seen in the first example below, the search query can contain both keywords and semantic search terms. The semantic search terms are marked as a so-called opaque term: @xmlf2 is used for fragment queries, and @xmlp is used for XPath queries. To give you a feeling for what such queries look like, here are some examples:
- "Documents that contain Alex' phonenumber"
Alex @xmlp::'phonenumber'XMLFragment query:
- "Reports about brake problems in the north San Francisco area"
'@xmlp::'brakeProblem' @xmlp::'location[@city="San Francisco" and @orientation="north"]''XML Fragment query:
@xmlf2::'<brakeProblem/> <location city="San Francisco" orientation="north"/>
- "Articles that talk about a certain protein and a disease in the same paragraph"
@xmlp::'paragraph[protein ftcontains ("BIKE") and disease]'XML Fragment query:
We do not expect most search end users to type in these queries themselves. A search application will most likely hide the query complexity behind UI elements like fields or drop down lists. This is similar to database applications today - not many people type in SQL directly, but just enter the values they are interested in using a (structured) search interface.
The following screenshot is an example for a simple semantic search application on insurance accident reports. Here, text analysis is used to extract concepts like cars and their make, license plates and people involved in car accidents. The user can now search for specific concepts, like accident reports that contain a person named "Lang," or use the checkboxes next to the concepts for queries like: "Reports that included males above 32 years of age, and a car with make 'Audi.'"
Figure 2. A simple semantic search application
Text analytics plays a crucial role when stepping beyond keyword search. Like the search paradigms, text analysis technologies typically build on each other. For example, a text analysis step that generates a document summary will improve its results if words and sentences in the document have been identified before.
IBM has developed an architecture, the Unstructured Information Management Architecture (UIMA), that allows you to combine different text analysis steps, developed by different teams or companies. This way, the appropriate steps can be "plugged together" into advanced analysis capabilities that detect and locate information of interest in document collections.
The analysis logic component developed using UIMA is called an annotator. Each annotator performs specific linguistic analysis tasks (for example, detecting cars and their make) . The result of the analysis is modeled in so-called annotations. An analysis engine is the container for any number of annotators, which typically build on each other’s annotations.
Figure 3. Text analysis with UIMA annotators
The example above shows some of the processing involved in detecting brake problems in a car repair report. The base processing (not shown here) is identifying the individual words of the sentence, shown in the green boxes at the bottom in the green boxes at the bottom. Based on this processing, three annotators contribute different information. They do this by adding annotations (like "Brake Part"), which cover certain areas of the input text. As annotators run in sequence, they build on one other's output:
- The Part Annotator just identifies brake parts, maybe with a simple lookup in a parts list.
- The Problem Annotator may also have a list for typical problem indicators and defects. Additionally, it can identify brake problems, by checking whether a Brake Part annotation occurs next to a Problem Indicator annotation. It then creates a Brake Problem annotation in the CAS to record this fact. Note that the detection rule involved here can be very generic, as it doesn't need to be concerned with the actual terms like "shoe" or "malfunction."
- Finally, the Problem-cause Annotator looks at the Brake Problem and Defect annotation, and uses the "due to" construct to identify a Problem-Cause relationship. This relationship is also modeled as an annotation, which has two attributes referencing the Brake Problem and Defect annotations
WebSphere Information Integrator OmniFind Edition incorporates and exposes the UIMA framework. WebSphere Information Integrator OmniFind Edition comes with a base stack of annotators for language identification, word segmentation, and categorization, which can be extended by third party annotators to perform deeper analysis for specific languages and domains. Such annotators can be developed using the UIMA Software Development Kit (SDK) . The UIMA SDK allows you to build and test your own annotators which are then integrated into the WebSphere Information Integrator OmniFind Edition environment.
The following diagram shows how UIMA integrates WebSphere Information Integrator OmniFind Edition base annotators, third party annotators and semantic search indexing to build a semantic search solution:
Figure 4. OmniFind components involved in semantic search
|XML error: The image is not displayed because the width is greater than the maximum of 580 pixels. Please decrease the image width.|
In this diagram, documents are crawled by WebSphere Information Integrator OmniFind Edition from several sources shown at the very left. The natural language content is extracted from the different document formats (like PDF, Microsoft Office™, XML, HTML, and so on) and sent to UIMA annotators for analysis. First, the sequence of WebSphere Information Integrator OmniFind Edition base annotators runs on each document. In the example, custom annotators build on the base result by identifying named entities (like person or product names) and their relationships (for example, "shoe" is a "brake part"). The result of the analysis is a document enriched with various annotations. This document is then stored in the search index. A normal keyword search index only holds information about the words in the document. The WebSphere Information Integrator OmniFind Edition semantic search index additionally contains information about the extracted concepts which can then be used in semantic search. Hence, in the example above, users could now type in @xmlp::'brakePart' and get documents back talking about some brake shoe.
A semantic search solution is typically tailored to some domain or industry to provide the most benefit. Some examples of solution areas are:
- Automotive Quality Early Warning
- Customer Support and Self Service
- Media Monitoring and Competitor Analysis
- Financial Services
- Anti-Money Laundering
- Insurance Fraud Analysis
- Broker Self Service
- Intelligence for Anti-Terrorism and Law Enforcement
- Case Management for Law Enforcement and Social Services
Many of these solution areas rely on domain specific annotators and application code that is not part of WebSphere Information Integrator OmniFind Edition. For example, a solution for Intelligence for Law Enforcement could combine annotators that spot criminal events, persons, places and times involved. The results can be made available through semantic search, perhaps combined with statistics evaluations that compute trends and associations in events. We do not expect most customers to write such annotators or applications. UIMA is an open platform that fosters an ecosystem of industry and academic partners which can provide annotators and ready-to-use solutions for the above domains.
Consequently, the focus of this article series is not on the development of text analysis steps. This is covered in depth, together with a lot of samples in the UIMA SDK documentation. Our focus will be the steps involved in developing a semantic search solution within WebSphere Information Integrator OmniFind Edition. The next article in this series will be a tutorial showing the end-to-end steps involved in deploying a semantic search solution within WebSphere Information Integrator OmniFind Edition. Subsequent parts will cover the use of pre-structured information like XML for semantic search, and go into the details of our semantic search capabilities. We will build these parts around everyday tasks in a police station, where police officers need to access police reports quickly and easily. Each part of the tutorial will discuss the steps involved in building such a solution, which are outlined below:
- Understand the users' search needs.
What are the concepts and relationships needed in a particular search task? For example, product and employee names may be needed to enhance "general purpose" search on a pharma company internal website, while people in the Research & Development need to be able to use variants of drug names and "drug-causes-cure" relationships.
- Understand the document set.
What are the kind of documents users are dealing with? Do the documents already contain a structure that could be exploited in the search (for example, content within XML tags)?
- Clarify the technology options.
What kind of text analysis is needed to extract the kind of information from document collections to be searched?
- Map text analysis results to semantic search.
Using the information gathered in the steps above, determine which text analysis results should be accessible using semantic search.
- Design the semantic search application.
How should the search user interact with the additional capabilities of semantic search? What should the user interface look like?
- Use the UIMA SDK to develop annotators for each analysis step. Alternatively, obtain text analysis technology from IBM or other companies. Embed the annotators in analysis engines and perform tests.
- Deploy your text analysis components as an analysis engine archive within WebSphere Information Integrator OmniFind Edition.
- Associate one or more document collections with your analysis engine.
- For each collection, map the text analysis results to the semantic search index.
- If required, set up your custom semantic search application, for example deploy your browser-based search user interface into an application server.
- Crawl, parse, and index your documents in your semantic search collection, just like in any keyword based collection.
- Start the search and exploit the semantic search capabilities.
- Deploy enhanced text analysis components that detect additional concepts.
- Adapt your search application to make use of the additional information.
The sample scenario in the tutorial (see Resources) is enhanced iteratively. Each part focuses on a specific aspect of the above workflow, and will make the police officer’s work easier by enhancing the search capabilities of the solution. So, even if you have just got a speeding ticket, we hope you will join us in looking beyond keyword search!
DB2 Information Integrator OmniFind Edition was recently renamed WebSphere Information Integrator OmniFind Edition. You might see references to WebSphere Information Integrator OmniFind Edition on the product Web pages, but the published product documentation and support site still reflect the DB2 brand. The product is also frequently referred to by its functional description: enterprise search.
- For a related tutorial on this topic, see Semantic search in WebSphere II OmniFind Edition: Deploy a semantic search solution
- For information on what UIMA is, read about the Unstructured Information Management Architecture Project.
- The UIMA SDK Release 1.1 can currently be downloaded on the developerWorks Unstructured Information Management Architecture SDK site. This release is tested for integration and compatibility with WebSphere Information Integrator OmniFind Edition Version 8.2.2.
- For a guided tour of the WebSphere Information Integrator OmniFind Edition, register for the Intro to the WebSphere Information Integrator OmniFind Edition tutorial on developerWorks.
- For current product documentation and support, visit the WebSphere Information Integrator OmniFind Edition support Web site.
- For the latest product information, visit the WebSphere Information Integrator OmniFind Edition Web site.
- For information on building your own custom analysis, visit the developerWorks Unstructured Information Management Architecture SDK Web site.
- For documentation, visit the WebSphere Information Integrator OmniFind Edition info center.
Thomas Hampp is the leading expert on Text Analysis in IBM Software Group with special focus on text analysis, text mining and language processing. He has been working in these areas for several years, both in IBM research and development. During his years in research, he developed and patented the first version of a text analysis framework that has now developed into UIMA. He is a member of the architecture boards for UIMA, WebSphere Information Integrator OmniFind, and Content Management. His current responsibility is defining and extending the integration architecture of UIMA into WebSphere Information Integrator OmniFind.
Since joining IBM in 2000, Alexander Lang has worked on several IBM products in the content management, search, and text analysis space. For the current release, he developed the mapping of UIMA annotations to semantic search index structures, and led the team to support customer-defined synonyms for search terms. Besides his architecture and team lead responsibilities, Alexander works with customers and IBMers to identify the right mix of text analysis involved in custom semantic search solutions, and is investigating the use of taxonomies and ontologies in text search.