Extracting Metadata

About this task

Once you have defined the resources that you want to crawl and index, you must next define any special ways in which you want to process those resources. In the case of this tutorial and based on our analysis of the input files, we want to extract certain information from the files that we are crawling and create specific metadata elements from that data.

The process of processing the content from the resources that you are crawling is known as converting. Once converted, this content is delivered to the search engine for subsequent use and eventual indexing. Watson Explorer Engine includes a number of built-in converters for web pages, PDF documents, Microsoft Office documents, and many more. However, since the goal of this tutorial is to identify, extract, and use metadata that is specific to a given set of documents, we will have to add a custom converter to extract this information for our use.

Procedure

  1. Click the Converting tab to display the screen on which you can define your own custom converter, as shown in Figure 1.
    Figure 1. The Converting Tab in the Watson Explorer Engine Administration Tool
  2. Click Add a new converter, select Custom Converter from the dialog that displays, and click Add to create a new converter, as shown in Figure 2.
    Figure 2. Adding a Custom Converter for Retrieved Content

    The fields on this screen and the values to which you should set each field are the following:

    • Type-In - The type of input data (HTTP content-type) that this converter should process. Set this to text/html.
    • Type-Out - The type of output that this converter should create. In this case, we want to produce generic, unnormalized VXML (a custom XML implementation that is used within Watson Explorer Engine) that can subsequently be processed by other portions of Watson Explorer Engine. Set this to application/vxml-unnormalized.
    • Name - A name for this converter. For the purposes of this tutorial, set this to Create Metadata from Content.
    • Action - the type of processing to perform on the input. In this case we want to treat it as HTML and process it as XSL. Set this to html-xsl.
    • XSL Parser Code - The actual XSL code for processing the input and producing output. You can see the code for this converter in the following figure. When writing an XSL parser for deployment, you would usually also set content element attributes such as weight, action, and output-action, but those attributes are not used here to streamline this tutorial. Enter the parser code shown in the following listing:
      Note: The XSL provided here converts the contents of the HTML file into usable Watson Explorer index documents. A Watson Explorer index document consists of a <document> node with child <content> nodes defining the contents of the indexed items. See the Watson Explorer Engine Schema for a detailed definition of the document and content nodes.
      <xsl:template match="/">
        <document>
      <content name="title" output-action="bold" weight="3">
        <xsl:value-of select="html/body/h1" />
      </content>
      <xsl:apply-templates select="//td/@id" />
        </document>
      </xsl:template>
      <xsl:template match="td/@id">
        <xsl:choose>
      <xsl:when test=" . = 'synopsis'">
        <content name="snippet">
      <xsl:value-of select=".." />
        </content>
      </xsl:when>
      <xsl:when test=" . = 'year'">
        <content name="year">
      <xsl:value-of select=".." />
        </content>
        <content name="datesecs">
      <xsl:value-of select="viv:parse-date(..,'%Y')" />
        </content>
      </xsl:when>
      <xsl:otherwise>
        <content name="{.}">
      <xsl:value-of select=".." />
        </content>
      </xsl:otherwise>
        </xsl:choose>
      </xsl:template>
    Note: This parser extracts content elements with the names title, year, datesecs (a calculated version of the year that will be used in later tutorials), genre, author, publisher, sales_20(08-11), hero, and snippet. Both title and snippet are predefined content elements in Watson Explorer Engine, and are automatically extracted by the Watson Explorer Engine search engine during a crawl, but in this case, we want to explicitly set them to specific portions of the input data.
  3. Once you have entered these settings, click OK to save your changes and close the converter that you have just created.

Results

Note: If your search collection resource is seeded from a URL and the files the search collection references are HTML files, you must add a converter that extracts links from HTML files and enqueues them for crawling and indexing. This converter, known as the HTML Link Extractor, is especially important because the crawler will begin by retrieving the index.html file in the seed for the collection, and must extract links to the other files from this file. For any search collection that crawls HTML files and needs to follow links in those files, the HTML Link Extractor converter must also appear as the first converter in the list shown in Figure 1. In general, when you add your own converter for HTML files to the conversion process, you must also add the HTML Link Extractor as the first converter. If you have not added your own converter whose input type is text/html, HTML link extraction will still be done internally during the conversion process. To add the HTML Link Extractor, click the down arrow to expose the list of available converters, select HTML Link Extractor from that list, and click Add. An edit screen displays, showing the different variables that you can set for this converter. Since we only want its default behavior, you do not need to modify anything. Click OK to accept the converter as-is.

To proceed to the next section of this tutorial, click Creating a Fast Index for Metadata.