Building Virtual Documents

About this task

To create richer search results, converters must be specified to properly parse the content on the summary pages for each graduate and thesis, and to associate that content with the appropriate PDF file. First, a converter must be specified to enqueue the articles that we will be crawling. Next, a converter must be specified to parse selected pages when crawling them, generating VXML content elements that hold the metadata to be associated with each thesis.

To extract the links from the pages that are being crawled.

Procedure

  1. In the Configuration section, select the Converting tab.
  2. Click Add a new converter
  3. Select Custom Converter from the dialog that displays
  4. Click Add.
  5. Set the following values on the new converter:
    • Name: Enqueue Articles - this names the converter so we can easily identify it in the list of converters.
    • Type-In: text/html - This converter will run when the source file is an HTML page.
    • Type-Out: application/vxml-unnormalized - This converter outputs unnormalized VXML.
    • Output forking: (leave unset) - This converter does not use forking, since its output needs no further processing and will be associated with specific document elements.
    • Test url with wc-set:
      http://d-scholarship.pitt.edu/view/year/2013.html
    • Action: html-xsl - This is an XSL parser that expects HTML input.
  6. In the Action text box, enter the following XSL
    <xsl:template match="//p/a">
     <xsl:value-of select="viv:crawl-enqueue-url(@href)" />
    </xsl:template>

  7. At this point, crawling and indexing this search collection and performing a test search would show results that display standard title/snippet search results that map to the PDF files for the thesis for that person. While this is perfectly usable, it is not optimal. Watson Explorer Engine enables you to create much richer search results by extracting additional information from the pages that you are indexing, creating virtual documents that provide richer results for any searches that you perform.

  8. In the Configuration section, select the Converting tab.
  9. Click Add a new converter
  10. Select Custom Converter from the dialog that displays
  11. Click Add.
  12. Set the following values on the new converter:
    • Type-In: text/html - This converter will run when the source file is an HTML page.
    • Type-Out: application/vxml-unnormalized - This converter outputs unnormalized VXML.
    • Output forking: (leave unset) - This converter does not use forking, since its output needs no further processing and will be associated with specific document elements.
    • Test url with regex:
      ^http://d-scholarship.pitt.edu/[0-9]{5}/$

      This regular expression represents the URLs of all description pages.

    • Action: html-xsl - This is an XSL parser that expects HTML input.
  13. In the Action text box, enter the following XSL

    (you can skip the XSL comments - they're only present to help explain the XSL):

    <xsl:template match="/">
     <vce>
    <!-- Store the external links to the pieces of this article --> 
     <xsl:variable name="urls" select="//a[contains(text(),'Download')]/@href" />
     <xsl:value-of select="viv:crawl-enqueue-url(//a[contains(text(),'Download')]/@href)" />
    <!-- Store metadata fields from each row of the table -->
     <xsl:variable name="fields" select="//div[@class='ep_summary_content_main']/table[3]/tr" />
     <xsl:for-each select="$urls">
     <xsl:apply-templates select="$fields">
     <xsl:with-param name="key" select="." />
     </xsl:apply-templates> </xsl:for-each>
     </vce>
    </xsl:template>
    
    <xsl:template match="tr">
     <xsl:param name="key" />
     <content name="{viv:str-to-lower(viv:replace(substring-before(th,':'), '[^a-zA-Z]', '-', 'g'))}" action="none" add-to="{$key}" output-action="bold">
     <xsl:value-of select="td" /> </content>
    </xsl:template>
    
    <xsl:template match="tr[not(th) or th = 'Title:']" />
     <!-- <xsl:template match="tr[not(th) or th = 'Files' or th = 'Advisory Committee']" /> -->
  14. Click OK to save.

    Reorganize the converters so that the Enqueue Articles converter is first in the list of converters. To do this

  15. Select the number in the '#' column beside the Enqueue Articles converter.

    (the left-most column of the converter list)

  16. Drag the number for the Enqueue Articles converter to the beginning of the converter list.
  17. Click the Searching tab
  18. Click Edit to examine the value of the Output field in the Contents subsection.

    The default value specifies that all content, with the exception of the URL, host-name, crawled date, and snippet, will be shown in the output.

  19. Click Cancel to return to the Searching tab without making any changes.
  20. Click the Overview tab.
  21. If you have not previously performed a crawl, click Start to the right of the Live Status label to begin a new crawl.
  22. If you have previously performed a crawl, click Start beside the Staging Status label.

    Since all of the documents that we are crawling reside on one server, the crawl might need a little while to complete.

  23. When it finishes, click Search

    The system displays a search page showing both dynamically generated snippets from the PDF document text, and the added content elements from the description pages.

Results

Scrolling down to examine the search result for any thesis shows a rich selection of fields in the virtual documents, including detailed information about the author, the degree program for which the document was written, the school the author attended at Pitt, keywords, an abstract, and so on.

To proceed with this tutorial, click Final Results Showing Virtual Documents.