Creating the common analysis structure to index mapping file

Using the common analysis structure to index mapping file, you can determine which analysis results in the common analysis structure you want to index.

About this task

The common analysis structure to index mapping file is in XML. The sample common analysis structure to index mapping file is based on the type system defined for the police report scenario.

<?xml version="1.0" encoding="UTF-8"?>
<indexBuildSpecification
xmlns="http://www.ibm.com/of/822/consumer/index/xml">
<skipCondition>
    <type>com.ibm.uima.tt.DocumentAnnotation</type>
    <filter syntax="FeatureValue">toBeprocessed = 0</filter>
</skipCondition>
<indexBuildItem>
  <name>com.ibm.analytics.types.Person</name>
  <indexRule>
    <style name="Annotation">
    <attributemappings>
      <mapping>
        <feature>role</feature>
        <indexName>role</indexName>
      </mapping>
      <mapping>
        <feature>title</feature>
        <indexName>title</indexName>
      </mapping>
      <mapping>
        <feature>gender</feature>
        <indexName>gender</indexName>
      </mapping>
    </attributemappings>
    </style>
  </indexRule>
</indexBuildItem>
<indexBuildItem>
  <name>com.ibm.analytics.types.Suspect</name>
  <indexRule>
    <style name="Annotation"/>
    <style name="Field">
    <style name="Facet">
    </style>
  </indexRule>
</indexBuildItem>
<indexBuildItem>
  <name>com.ibm.analytics.types.City</name>
  <indexRule>
    <style name="Annotation">
      <attributemappings>
        <mapping>
          <feature>cityDistrict</feature>
          <indexName>district</indexName>
        </mapping>
      </attributemappings>
    </style>
  </indexRule>
</indexBuildItem>
<indexBuildItem>
  <name>com.ibm.analytics.types.Date</name>
  <indexRule>
    <style name="Field">
      <attribute name="fixedName" value="Date"/>
    </style>
    <style name="Field">
      <attribute name="fixedName" value="hour"/>
    </style>
  </indexRule>
  <filter syntax="FeatureValue">year="2005"</filter>
</indexBuildItem>
<indexBuildItem>
  <name>com.ibm.analytics.types.PoliceReport</name>
  <indexRule>
    <style name="Annotation">
      <attribute name="fixedName" value="PoliceReport"/>
      <attributemappings>
        <mapping>
          <feature>crimeDescription</feature>
          <indexName>crimeDescription</indexName>
        </mapping>
        <mapping>
          <feature>time/coveredText()</feature>
          <indexName>time</indexName>
        </mapping>
        <mapping>
          <feature>date/englDate</feature>
          <indexName>date</indexName>
        </mapping>
        <mapping>
          <feature>location/coveredText()</feature>
          <indexName>location</indexName>
        </mapping>
        <mapping>
          <feature>knownSuspects[]/com.ibm.analytics.types.Suspect:surName</feature>
          <indexName>suspectsLastNames</indexName>
        </mapping>
      </attributemappings>
    </style>
  </indexRule>
</indexBuildItem>
<indexBuildItem>
  <name>com.ibm.lang.LastName</name>
  <indexRule>
    <style name="Facet">
      <attribute name="fixedName" value="$.lastName"/>
    </style>
  </indexRule>
</indexBuildItem>
</indexBuildSpecification>

The common analysis structure to index mapping file must contain all of the analysis results that you want to be able to query and view as facets in the content analytics miner and enterprise search applications.

Procedure

To create the common analysis structure to index mapping file:

  1. Create an XML file.
    To avoid XML syntax errors, use an XML editor or XML authoring tool of your choice. The XSD schema for the mapping file is called CasToIndexMapping.xsd in the ES_INSTALL_ROOT/configurations/parserservice/jediidata directory.
  2. Include your mappings in a <indexBuildSpecification xmlns="http://www.ibm.com/of/822/consumer/index/xml"> element. The namespace (specified in the xmlns attribute) must be exactly as shown.
  3. Add a <skipCondition> element to prohibit certain documents from being indexed, based on a certain feature value. This element is optional. In the example, documents that contain a data structure of type com.ibm.uima.tt.DocumentAnnotation with a feature named toBeProcessed set to zero are not indexed.
  4. Add one or more <indexBuildItem> elements that contains the mapping of one particular feature structure in the common analysis structure to a structure in the index.
  5. Save and validate the XML file.

Example

The <indexBuildItem> element
The common analysis structure to index mapping file contains one or more <indexBuildItem> elements. Each element describes the mapping of one particular feature structure in the common analysis structure to a structure in the index (a span, field, or facet).
The <name> element contains the feature structure type. There are two ways to specify a type:
  • The full type name. For example, com.ibm.analytics.types.Suspect
  • A wildcard. For example, com.ibm.analytics.types.*. The wildcard character can be added only at the end of the type specification.

Use only subtypes of uima.tcas.Annotation as index build items. If a feature structure is a subtype uima.cas.TOP (and not of uima.tcas.Annotation), you can access this feature structure by using a feature path starting from an annotation.

If type A is a subtype of type B (in the sample, com.ibm.analytics.types.Suspect as a subtype to com.ibm.analytics.types.Person), and there are <indexBuildItem> elements Ia and Ib defined for both types, processing is as follows:
  • Each index rule that is defined in Ib is applied to feature structures of type B and feature structures of type A
  • Each index rule that is defined in Ia is applied to feature structures of type A only

In the example, the <indexBuildItem> element that is defined for com.ibm.analytics.types.Person annotations also applies to com.ibm.analytics.types.Suspect annotations. Two spans are created for a suspect annotation: one named Person and the other Suspect.

The <filter> element is optional and is used to restrict the <indexBuildItem> mapping only to feature structures that have a certain attribute value. This element is useful if you want an attribute to act as a switch for what to index. For example, persons and organizations might be recorded in an annotation of type EntityAnnotation. Its feature called type is set to either person or organization. To extract only the persons, and not the organizations, you can add the following filter:
<filter syntax="FeatureValue">type = "person"</filter>

Moreover, you could choose to index persons and organizations under different span names, for example, person and organization. To index persons and organizations under different span names, define two <indexBuildItem> elements of type EntityAnnotation and use two filters on the type feature to trigger either the persons or the organizations.

The <indexRule> element
Each <indexBuildItem> element contains one <indexRule> element. Each <indexRule> element contains all the information that is needed to map a feature structure in the common analysis structure to the index as a field, annotation, or facet. The Annotation, Field, and Facet styles support a number of attributes.
Restriction: Watson Explorer Content Analytics does not support the Term style that is predefined in the UIMA Software Development Kit.
For the Annotation, Field, and Facet, the following alternatives exist when you specify the annotation, field, or facet name to include in the index:
  • Use fixedName if you want each feature structure to be accessible in the index under the same name. In the following example, each feature structure of type com.ibm.analytics.types.Person is mapped to a span named "Person" in the index.
    <indexBuildItem>
      <name>com.ibm.analytics.types.Person</name>
      <indexRule>
      <style name="Annotation">
         <attribute name="fixedName" value="Person" />
      </style>
      </indexRule>
    </indexBuildItem>

    This example enables queries like "Give me documents where Boss is contained as a person name". The query is expressed as follows by using XML fragments: @xmlf2::'<Person>Boss</Person>'

  • Use nameFeature if the annotation stores different entities that you want to be able to access by using different spans depending on the value of a certain feature of the annotation. In the following example, com.ibm.tt.EntityAnnotation is indexed as a person or organization span, depending on the value of the feature named type. The feature can also be a feature path.
    <indexBuildItem>
      <name>com.ibm.tt.EntityAnotation</name>
      <indexRule>
        <style name="Annotation">
          <attribute name="nameFeature" value="type" />
        </style>
       </indexRule>
    </indexBuildItem>

    This example enables queries like "Give me documents about the organization WHO" (as opposed to the English term "who"). The query is expressed as follows in limited XPath syntax: @xmlp::'/organization[ftcontains="WHO"]'

  • If the <attribute> element is not specified, the short name of the annotation type in the <indexBuildItem> element is used by default. For example:
    <indexBuildItem>
      <name>com.ibm.uima.tutorial.RoomNumber</name>
    	 <indexRule>
    		<style name="Annotation" />
    		<style name="Field" />
    		<style name="Facet" />
    	 </indexRule>
    	</indexBuildItem>
    This <indexBuildItem> element results in annotations, fields, and facets named RoomNumber that are populated with the text covered by the com.ibm.uima.tutorial.RoomNumber annotation.
The <style name="Annotation" /> element
Annotation in the <style> element specifies how you can access span information. Besides allowing the use of the fixedName and nameFeature attributes, this style also supports the <attributemappings> element. Within this element, it is possible to map the value of a feature to an attribute of the resulting span in the index, which you can later use in a search expression.
Each mapping is done within a separate <mapping> element. The <feature> element contains a feature path, and the <indexName> element contains the name of the attribute that is used in the index to store the value of <feature>. For example,
<mapping>
	<feature>make/companyname</feature>
	<indexName>company</indexName>
</mapping>
This <mapping> element stores the value of the feature in the path make/companyname directly in the index attribute company.

Mapping feature values to index attributes is especially useful if the type system used during text analysis is complex, including many nested feature structures. Using the <mapping> element, relevant attributes can be exposed, allowing you to use them in queries without detailed knowledge of the original type system structure.

The <style name="Field" /> element

Field in the <style> element specifies the fields that you want to access. You can use the fixedName and nameFeature attributes. To use Field style rules in the index mapping file, you must use the administration console to define index fields for the fields that you want to search or analyze and configure the appropriate attributes, such as parametric or returnable.

Field information is always content searchable, that is, field information is accessible through keyword queries.

The optional attribute valueFeature defines which feature value to take as the field value. If the feature structure is an annotation, and the attribute is not set, the covered text of the annotation is used as the field value. In the example,
<indexBuildItem>
  <name>com.ibm.analytics.types.Date</name>
  <indexRule>
    <style name="Field">
      <attribute name="fixedName" value="date"/>
    </style>
    <style name="Field">
      <attribute name="fixedName" value="hour"/>
      <attribute name="valueFeature" 	value="hour"/>
    </style>
  </indexRule>
  <filter syntax="FeatureValue">year="2005"</filter>
</indexBuildItem>
Two fields are generated for com.ibm.analytics.types.Date. One field named date contains the covered text, for example, 5:15pm. Another field contains the value of the attribute hour. Here you can query by using 'hour::<17'.
The <style name="Facet" /> element
Facet in the <style> element specifies the facets to map to feature structures. You can use the fixedName and nameFeature attributes to specify the names of the facets to include in the index. When you specify the value for the fixedName attribute, ensure that you use the correct identifier depending on the type of collection. For enterprise search collections, specify the name of the facet as specified in the administration console. For content analytics collections, specify the facet path.
Requirement: When you include facet mappings in the index mapping file, you must use the administration console to define the facets that are stored in the index.
In addition to the &lt;atribute> element, this style supports the &lt;pahComponent> element. Use this element to specify how to construct the value of the facet. You can use the feature or literal attribute values for the name attribute of the &lt;pahComponent> element. With the &lt;pahComponent name=”feature”/> element, the value of the feature that is specified by the value attribute is used as the value of the facet in the index. In the following example, the value of the cityName feature of the com.ibm.analytics.types.City annotation is used for the value of the City facet.
<indexBuildItem>
  <name>com.ibm.analytics.types.City</name>
  <indexRule>
    <style name="Facet">
      <attribute name="fixedName" value="City"/>
      <pathComponent name="feature" value="cityName"/>
    </style>
  </indexRule>
</indexBuildItem>
With the <pathComponent name=”literal”/> element, the specified string in the value attribute is used as the value of the facet in the index. In the following example, the string Energy is used for the value of the technology facet.
<indexBuildItem>
  <name>com.ibm.analytics.types.Technology</name>
  <indexRule>
    <style name="Facet">
      <attribute name="fixedName" value="technology"/>
      <pathComponent name="literal" value="Energy"/>
    </style>
  </indexRule>
</indexBuildItem>

If no <pathComponent> element is specified, the entire text span that is covered by the annotation is used as the facet value.

For enterprise search collections only, you can specify multiple <pathComponent> elements per <style name="Facet"> element to produce hierarchical facets. In the following example, the value of the role feature of the com.ibm.lang.Expert annotation is used for the first-level value of the expert facet, and the value of the division feature is used for the second-level value of the facet to produce a hierarchical facet value such as Expert > Role > Division.
<indexBuildItem>
  <name>com.ibm.lang.Expert</name>
  <indexRule>
    <style name="Facet">
      <attribute name="fixedName" value="expert"/>
      <pathComponent name="feature" value="role"/>
      <pathComponent name="feature" value="division"/>
    </style>
  </indexRule>
</indexBuildItem>
If you specify multiple <pathComponent> elements in a <style/> element for a content analytics collection, a flat facet is created according to the first <pathComponent> element that is specified. All additional <pathComponent> elements in that style are ignored.

What to do next

After you create the XML file, you must use the administration console to upload it to Watson Explorer Content Analytics and select the common analysis structure to index mapping file when you configure text analysis engine options.