Discover concepts with Easy Semantic Search

Use OmniFind Enterprise Edition V8.4 to search for phone numbers, people, places, products, and more through simple keyword search

Get an introduction to "Easy Semantic Search," a new functionality introduced in IBM® OmniFind® Enterprise Edition V8.4 that enables end-users to leverage the power of semantic search to query for concepts through the well-accepted keyword query paradigm. Follow the steps outlined in this article to enable the Easy Semantic Search functionality with an example processing of telephone numbers. Then customize the Easy Semantic Search functionality by extending the sample configuration.

Markus Lorch, Software Engineer, IBM, Software Group

Markus LorchMarkus Lorch has led a development team to implement and improve text analysis and semantic search functions for OmniFind Enterprise Edition 8.4. Previously he has been involved with performance and scalability engineering for IBM enterprise search products. Before joining IBM in 2005 he was researching and developing authorization mechanisms for Grid computing environments and was actively involved in the standardization of Grid security architectures and protocols. Dr. Lorch holds a Ph.D. from Virginia Tech.



12 April 2007

Also available in Chinese

Introduction

Searching for concepts instead of or in addition to keywords is a powerful means to improve efficiency and usability of enterprise search solutions. With semantic understanding a query for "phone number" can, in addition to documents containing the keywords "phone" and "number," also return documents that contain actual telephone numbers and highlight the occurrences for easy reference in the search results. "Easy Semantic Search" is a new functionality introduced in IBM OmniFind Enterprise Edition 8.4 that enables end-users to leverage the power of semantic search to query for concepts through the well-accepted keyword query paradigm. No complex query language has to be learned or special-purpose user interfaces developed for the end-user to make use of this functionality. Semantic search requires text analysis components to analyze documents for concepts of relevance and create searchable metadata. Omnifind Enterprise Edition 8.4 ships with a powerful text analysis module that can be configured to detect a wide variety of concepts in processed documents. In this article, get an introduction to semantic search and follow the necessary steps to enable the Easy Semantic Search functionality with an example processing of telephone numbers. Furthermore, learn how to customize the Easy Semantic Search functionality by extending the sample configuration.


Overview – OmniFind processing and custom text analysis

OmniFind Enterprise Edition V8.4 provides a configurable text analysis module that can detect expressions and enumerable entities (like people, places, and things) when documents are indexed for enterprise search. This module is called the Regular Expression Annotator. Furthermore, a novel concept called “semantic synonyms" radically simplifies searching for concepts extracted from unstructured text by transforming the user's keyword query. Together, these mechanisms empower enterprise search users to discover information more effectively by searching for both keywords and semantic concepts through the familiar keyword-based query interface. Complicated query syntax or tedious forms to select options from are unnecessary.

For example, a search query for "laptop product number" will not only discover documents containing the stated keywords, but will also locate documents that contain "laptop" together with a detected product number such as 266-H2G. The detected product numbers in this example are also highlighted in the search result. The functionality can also be used to detect and later search for enumerable sets of named entities, such as product, country, or company names. A query for "G8 county climate regulations" would discover documents that contain "climate" and "regulations" together with the name of a country that is part of the group of G8 countries.

The Easy Semantic Search functionality is provided by three distinct components in the OmniFind system:

First, the regular expression annotator is added to OmniFind's Unstructured Information Management Architecture (UIMA) processing pipeline to detect instances of concepts in the document text, based on a set of extensible rules. The discovered meta information is stored in the search index.

Second, OmniFind's synonym dictionary mechanism is used to define semantic synonyms for keywords. The semantic synonyms take the form of XML fragment query expressions and can be used together with ordinary synonyms. For example, the keywords "phone number" can have the synonym "telephone number" as well as the XML fragment query @xmlf2::'<#phonenumber/>'.

Third, a semantic synonym-expansion code located in the search application retrieves synonyms for the user-provided keywords from the OmniFind back end and transforms the original query into an XML fragment query that combines original keywords with regular and semantic synonyms. The expansion logic does not replace the original keyword but rather creates an OR expression to locate either the keyword or a concept instance relating to the keyword. for example., an example query 'IBM phone number' becomes 'ibm <.or> phone <#phonenumber/> </.or> <.or> number <#phonenumber/> </.or>'. Because the original keyword in the above example consists of two terms (phone number), for which a semantic synonym exists, the expansion logic follows Boolean algebra rules to create two OR parts connected by AND. (A ^ B) v C becomes (A v C) ^ (B v C), which can easily be represented in an XML fragment expression.

Enable Easy Semantic Search

OmniFind Enterprise Edition, Version 8.4 ships with three files required to enable the Easy Semantic Search functionality in a sample configuration. The files are located under the OmniFind Enterprise Edition installation root directory in the packages/uima/regex subdirectory. The procedure is described in detail in the Text Analysis Integration manual section "Easy semantic search using the regular expression annotator" (see Resources) and outlined here:

  1. The first step enables the text analytics: Upload the PEAR file containing the pre-configured regular expression annotator (of_regex.pear) into the OmniFind Enterprise Edition system using the OmniFind administrative console (system tab, edit parser). Then associate the annotator with the collection it is to be used with (collection's parser settings).
  2. In step two, the text analysis results are mapped to the index: For this, upload the common analysis structure to index mapping file (of_sample_regex_cas2index.xml) and associate it with the collection (also in parser settings). With both the annotator and the mapping file linked to the collection, documents can be crawled, parsed, and indexed.
  3. Step three configures the semantic synonym expansion of keywords: For this, upload a provided sample synonym dictionary (of_sample_synonym_dic.dic) to the OmniFind system (system tab, edit search) and associate it with the search component of this collection (collection's search settings).

By enabling the option "automatically search for synonyms using semantic expansion" in the OmniFind search application preferences screen, the collection can now be searched on with semantic synonyms. With the functionality enabled, basic keyword queries for "phone number" will be expanded to the semantic concept <phonenumber/>, in addition to a few regular synonyms like "telephone number." When an actual telephone number is present in a digested document, this phone number is annotated with the concept <phonenumber/> during the text analysis step, and the document is returned to a matching query with the phone number itself highlighted in the result set. Documents containing the words "phone" and "number," as well as one of the regular synonyms, are also found. The semantic synonym expansion mechanism is limited to expand simple keyword queries. Fielded queries or queries with advanced query terms are currently not expanded. The additional text analysis performed by the regular expression annotator does have a performance impact and will reduce the maximum parser throughput.

Figure 1. Search result with semantic search for telephone numbers on a Web collection
Search result identifying telephone numbers

Customizing the regular expression annotator

Discovering telephone numbers and URLs in documents is a great way to get an idea of what is possible with the Easy Semantic Search functionality of OmniFind Enterprise Edition 8.4. In production scenarios, other concepts may be of importance. The flexibility of the system allows the detection of a large variety of concepts or entities. To detect additional concepts, the rules that govern the text analysis of the regular expression annotator, the index mapping file that instructs the system how to store discovered facts in the index, and the synonym dictionary that defines the binding between keywords and semantic search concepts must be adapted.

The remainder of this section explains how to extend the example rules, mappings, and synonyms. It also discusses a rule evaluation approach. The examples provided show how you can configure OmniFind Enterprise Edition to also discover and search for IBM laptop product numbers when the keywords "IBM laptop" or "thinkpad" are given and to also be able to detect and search for occurrences of country names in texts.

Customizing the rules

The regular expression annotator is configured by means of an XML file. This file contains a set of rules that define on what type of character or number sequences the annotator should act and how it should act. The detailed description of the file and rule format can be found in the OmniFind Enterprise Edition 8.4 Text Analysis Integration guide (see Resources). The rule file is part of the annotator PEAR file. A PEAR file is actually a ZIP archive file that contains the annotator code and configuration in a well-defined directory structure. The XML rule file resides in a subdirectory named "xml" within the PEAR file and can be extracted from there with any ZIP file tool (for example, the jar command provided with a Java SDK). For convenience, the sample rule file that is part of the regular expression annotator PEAR (named of_sample_regex_rules.xml) is also provided in the subdirectory packages/uima/regex under the OmniFind Enterprise Edition installation root directory. This directory also contains the XML schema definition (ruleSet.xsd) that can be used to validate changes against the schema.

The sample rule file contains four rules: phonenumber, potential_phonenumber, url, and email. The URL and e-mail rules are fairly simple regular expressions, while the two phone number rules are more complex, as they aim to detect a multitude of alternative representations of international telephone numbers.

A simple approach to customizing the regular expression rules is to copy the complete rule definition of, for example, the URL rule and modify it. Simple rules require changes to the regular expression, the annotation id to be created (unique for each rule), and the type of the annotation to be created.

In Listing 1, a simple rule to match product numbers of Thinkpad laptops is created from the URL rule definition. The regular expression is changed to locate a four-digit number, followed by a dash, followed by a three-digit alphanumerical sequence. For example, the sequence 2668-H2G identifies a Thinkpad T43p. Further more, the id of the annotation to be created is changed to "thinkpad" with a type of com.ibm.es.uima.Thinkpad.

Listing 1. Sample rule to detect IBM laptop product numbers
<!-- IBM Laptop Product Number e.g. 2668-H2G --> 
<rule regEx="([0-9]{4}\x20?-\x20?[A-Z,0-9]{3})"
        matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation">
      <createAnnotation id="thinkpad" type="com.ibm.es.uima.Thinkpad">
          <begin group="0"/>
          <end group="0"/>
      </createAnnotation>
</rule>

In Listing 2, a few countries are detected through a simple list of names that comprise the regular expression. Regular expression operators are used to allow for long and short versions (for example, People's Republic of China versus China) and to govern that trailing punctuation characters are allowed.

Note that some countries are detected in the first rule and an annotation of type "com.ibm.es.uima.Country" is created. Countries that are also members of the G8 countries are not detected by the first rule, but rather by a second rule that similarly creates the country annotation but also sets a feature of that annotation to identify these countries as belonging to the group of G8 countries. This enables queries for arbitrary countries where all named countries (from both rules) will be found, but queries for countries that are part of the G8 group are also possible. Simple hierarchies (another example are products that belong to a specific brand) can be implemented this way.

Listing 2. Two rules to detect countries
<!-- Country -->
<rule regEx="(Australia|Brazil|Spain|(People's Republic of )?e?China)s?(?!\w)"
            matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation">
   <createAnnotation id="country" type="com.ibm.es.uima.Country">
      <begin group="0"/>
      <end group="0"/>
   </createAnnotation>
</rule>

<!-- G8 Country -->
<rule regEx="(Canada|France|Germany|Italy|Japan|Russian Federation|United
Kingdom|United States of America|USA|U.S.A.)s?(?!\w)"
            matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation">
   <createAnnotation id="country-G8" type="com.ibm.es.uima.Country">
      <begin group="0"/>
      <end group="0"/>
      <setFeature name="Group" type="String">G8</setFeature>
   </createAnnotation>
</rule>

When copying these example rules into the of_sample_regex_rules.xml file, take care not to introduce line breaks into the regular expressions.

Extending the type system

The annotation types used in the new rules must exist as part of the UIMA type system. The regular expression annotator's type system file is named of_sample_regex_typesystem.xml (also present in the xml subdirectory of the annotator PEAR file, as well as in the packages/uima/regex directory of the OmniFind installation) and must be extended with the additional types. The type system already contains types for phonenumber and the other annotation types needed by the sample regular expression annotator rules. Listing 3 illustrates the two new type definitions:

Listing 3. Type system definitions that need to be added for the two new annotation types
<!-- Thinkpad Annotation -->
<typeDescription>
    <name>com.ibm.es.uima.Thinkpad</name>
    <description>IBM laptop product numbers</description>
    <supertypeName>uima.tcas.Annotation</supertypeName>
</typeDescription>

<!-- Country Annotation -->
<typeDescription>
    <name>com.ibm.es.uima.Country</name>
    <description>Countries of the world</description>
    <supertypeName>uima.tcas.Annotation</supertypeName>
    <features>
        <featureDescription>
            <name>Group</name>
            <description/>
            <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
    </features>
</typeDescription>

Application, evaluation, and refinement

Modifying and extending the rules of the regular expression annotator to detect additional concept instances in unstructured texts often requires an iterative approach -- an initial set of rules are written and tested on sample documents, then the rules may be refined and retested. When the outcome is satisfactory, the rules in the original annotator PEAR file can be updated with the new rules and the resulting version installed in OmniFind Enterprise Edition and put to use. The UIMA SDK can be used to run the regular expression annotator outside of OmniFind on selected texts in order to evaluate modified or newly created rules.

UIMA SDK, Version 1.4.4: Download, run, and evaluate the regular expression annotator rules.

The UIMA SDK is available for download from IBM developerWorks for Windows and UNIX operating systems for deployment on a developer workstation. The PEAR installer utility (runPearInstaller command) that is part of the SDK can be used to install the regular expression annotator PEAR that shipped with OmniFind Enterprise Edition 8.4 (of_regex.pear found in the subdirectory packages/uima/regex under the OmniFind Enterprise Edition installation root directory) into a local directory and to run the annotator in the CAS Visual Debugger tool (CVD). CVD runs the annotator on sample texts that can be copied into the CVD interface. After a successful run, the created annotations are listed and the corresponding original text sections highlighted. Figure 2 illustrates the CVD interface:

Figure 2. Search result with semantic search for telephone numbers on a Web collection
The CAS Visual Debugger can be used to inspect to validate Regular Expression Annotator rules.

The example provided in Figure 2 hints at the power of semantic search. The sample document could be located with a query for "IBM laptop" even though these keywords never appear in the document text directly. Also, a search for documents talking about G8 countries would return the document due to a match on Germany, Russia, and Italy.

Five easy steps to investigate annotations using CVD:

  1. Execute the runPearInstaller shell or batch script.
  2. Specify the PEAR file to be installed and provide a destination directory, then click on Install.
  3. Once installation has finished, click on Run your AE in the CAS Visual Debugger.
  4. Enter some sample text, including a telephone number like (800) 555-1234, in the right text window, and select Run > Run Regular Expression Annotator.
  5. Now browse the annotations in the "Analysis Results" window: Select AnnotationIndex, then select the detected instance of com.ibm.es.uima.PhoneNumber in the window below.

The rule and type system changes discussed in the previous section must be applied to the XML rule and type system files in the installed version of the regular expression annotator PEAR. During the installation process performed by the UIMA SDK PEAR installer tool, the PEAR file is extracted to a directory specified in the installation wizard. The rule and type system files are located in the subdirectory named jedii_an_regex/xml. When modifying these files, it is a good idea to make backup copies of the working versions and to make incremental changes and evaluate often.

After the rules and the type system have been modified, the CAS Visual Debugger tool (CVD) from the UIMA SDK can be used to validate the changes. Before starting CVD from the command line, the Java classpath environment variable must be augmented with the list of resources provided in the jedii_an_regex/metadata/setenv.txt file. (Note: When CVD is started by the PEAR installer, the classpath is automatically set, but not when CVD is started directly from the command line.) With the augmented classpath in place, CVD is started using the cvd command. To load the annotator, select Run > Load TAE from the CVD menu, then browse for the descriptor file of the Regular Expression Annotator jedii_an_regex/desc/jregex.xml.

Sample text can be typed or copied into the content field on the right-hand side of the CVD window and the processing by the regular expression annotator started ("Run", "Run Regular Expression Annotator"). The left-hand Analysis Result window then displays the annotation instances created by the annotator on the sample text, sorted by annotation type. If a particular annotation is selected, then the original text covered by this annotation is highlighted. This is particularly useful when small differences have to be investigated (for example, is the trailing white space part of the annotation?). The rules can now be modified or augmented as needed in order to receive the desired result. More information on regular expressions is available in several online resources, including many tutorials. Rules can be changed and evaluated without restarting CVD -- simply reload the annotator descriptor, and rerun the annotator.

PEAR packaging

The installed version of the regular expression annotator now has the desired set of rules configured. In order to deploy this configuration, you must build a PEAR file with the new rules. The simplest way to do this is to make a copy of the original regular expression annotator PEAR and insert the two XML files into the subdirectory xml of the archive. Any ZIP archive tool that is capable of replacing files in a ZIP archive can be used (for, the JAR utility that is provided with prevalent Java JDK distributions). When inserting the new file versions of_sample_regex_typesystem.xml and of_sample_regex_rules.xml, it is important that the file names are not altered but that the original versions are overridden. You can now use the administrative console to upload the new PEAR file to the OmniFind system and assign it to collections.

Annotations to index mapping

In order for OmniFind to not only create but also store the new annotations in the search index, the Common Analysis Structure to Index mapping file also needs to be augmented with rules for the new annotation types. You need to add an <indexBuildItem>, that maps the annotation to either a searchable span (annotation style), or a field (field style), or both for each annotation. The example in Listing 4 creates a searchable span for every "thinkpad" annotation as well as a field that will hold all product numbers that were located in a particular document. You need to add both examples in Listings 4 and 5 to the of_sample_regex_cas2index.xml (found in the packages/uima/regex directory of the OmniFind installation).

Listing 4. Mapping of the Thinkpad annotation to a searchable span as well as a field in the index
<indexBuildItem>
    <name>com.ibm.es.uima.Thinkpad</name>
    <indexRule>
       <style name="Annotation">
           <attribute name="fixedName" value="thinkpad"/>
       </style>
       <style name="Field">
           <attribute name="fixedName" value="thinkpad"/>
           <attribute name="fieldSearchable" value="true"/>
           <attribute name="returnable" value="true"/>
       </style>
    </indexRule>
</indexBuildItem>

The second example in Listing 5 creates searchable spans for "country" annotations, including a mapping of value provided by the feature "group."

Listing 5. Mapping of the Country annotation to a searchable span in the index
<indexBuildItem>
    <name>com.ibm.es.uima.Country</name>
    <indexRule>
       <style name="Annotation">
           <attribute name="fixedName" value="country"/>
           <attributemappings>
               <mapping>
                   <feature>Group</feature>
                   <indexName>group</indexName>
               </mapping>
           </attributemappings>
       </style>
    </indexRule>
</indexBuildItem>

You need to upload the extended of_sample_regex_cas2index.xml file to the parser configuration of each collection that has the new regular expression annotator activated using the administrative console. Now the collection is ready to process documents with the customized annotator and store the results in its search index. This is a good time to perform end-to-end tests by processing a few documents that are known to contain detected product numbers and groups (for example, from the tests with the CVD tool during rule refinement). Once the documents are indexed, an XML fragments query, like one of the following examples, can be used to validate that annotations were created and indexed:

  • @xmlf2::'<#phonenumber/>' to search for annotations from the sample telephone number rule
  • @xmlf2::'<#country>/>' to search for a country
  • @xmlf2::'<#country group="G8"/>' to search only for countries belonging to the G8 group
  • @xmlf2::'<#thinkpad/>' to search for thinkpad annotations

As thinkpad annotations are also mapped to returnable and field-searchable fields, you can issue a fielded search for "thinkpad:" to receive all documents with a thinkpad annotation. If the detailed view ("Show Details" link above the search results) is activated, the field values (the detected product numbers) are also visible.

Writing semantic synonyms

To make the new annotations available with keyword search, the synonym dictionary for this collection must be extended. The XML source for the sample synonym dictionary (of_sample_synonym_dic.xml) is also provided with the Omnifind installation and can be used as a starting point. The set of <synonymgroups> in the file needs to be extended with three more synonym groups, as illustrated in Listing 6.

Listing 6. Definition of semantic synonyms.
<synonymgroup>
    <synonym>thinkpad</synonym>
    <synonym>laptop</synonym>
    <synonym>@xmlf2::'<#thinkpad/>'</synonym>
</synonymgroup>
<synonymgroup>
    <synonym>country</synonym>
    <synonym>nation</synonym>
    <synonym>@xmlf2::'<#country/>'</synonym>
</synonymgroup>
<synonymgroup>
    <synonym>g8 country</synonym>
    <synonym>g8 nation</synonym>
    <synonym>@xmlf2::'<#country group="G8"/>'</synonym>
</synonymgroup>

Once the synonym file is extended, a binary dictionary needs to be built from it. It is important to keep the brackets of the XML fragment expression encoded with < and > in the dictionary XML file. The essymdictbuilder tool available on the Omnifind server must be used to create the dictionary (for example, the command essyndictbuilder.sh of_sample_synonym_dic.xml of_sample_synonym_dic.dic on a Linux or AIX OmniFind server). The resulting dictionary can then be uploaded to the Omnifind system using the administrative console and associated with the search function.

Summary

This article provides an overview of the steps necessary to enable and customize the Easy Semantic Search functionality that allows end-users to issue powerful semantic queries though the well-known keyword search paradigm. Many text analysis tasks can be addressed by extending the rules of the regular expression annotator. For more complex text analysis tasks, the development of a custom annotator may be of interest. The tutorial "Semantic search in IBM OmniFind Enterprise Edition, Part 2: Semantic search with UIMA and OmniFind" (developerWorks, December 2006) provides step-by-step instructions on how to develop a custom annotator. It also covers the incorporation of custom annotators into the Omnifind system and the process of mapping annotations to the index in more depth. The Easy Semantic Search functionality introduced in this article can be used together with any annotator, including custom developments. A custom-developed annotator would replace or be deployed together with the provided regular-expression annotator.

Resources

Learn

Get products and technologies

  • UIMA SDK, Version 1.4.4: Download, run, and evaluate the regular expression annotator rules.
  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=208757
ArticleTitle=Discover concepts with Easy Semantic Search
publish-date=04122007