About this task
To create richer search results, converters must be specified to properly parse the content
on the summary pages for each graduate and thesis, and to associate that content with the
appropriate PDF file. First, a converter must be specified to enqueue the articles that we
will be crawling. Next, a converter must be specified to parse selected pages when crawling
them, generating VXML content elements that hold the metadata to be associated with each
thesis.
To extract the links from the pages that are being crawled.
Procedure
-
In the Configuration section, select the Converting
tab.
-
Click Add a new converter
-
Select Custom Converter from the dialog that displays
-
Click Add.
-
Set the following values on the new converter:
-
In the Action text box, enter the following XSL
<xsl:template match="//p/a">
<xsl:value-of select="viv:crawl-enqueue-url(@href)" />
</xsl:template>
-
At this point, crawling and indexing this search collection and performing a test
search would show results that display standard title/snippet search results that map to
the PDF files for the thesis for that person. While this is perfectly usable, it is not
optimal. Watson™ Explorer Engine enables you to create much richer search results by
extracting additional information from the pages that you are indexing, creating virtual
documents that provide richer results for any searches that you perform.
-
In the Configuration section, select the Converting
tab.
-
Click Add a new converter
-
Select Custom Converter from the dialog that displays
-
Click Add.
-
Set the following values on the new converter:
-
In the Action text box, enter the following XSL
(you can skip the XSL comments - they're only present to help explain the XSL):
<xsl:template match="/">
<vce>
<!-- Store the external links to the pieces of this article -->
<xsl:variable name="urls" select="//a[contains(text(),'Download')]/@href" />
<xsl:value-of select="viv:crawl-enqueue-url(//a[contains(text(),'Download')]/@href)" />
<!-- Store metadata fields from each row of the table -->
<xsl:variable name="fields" select="//div[@class='ep_summary_content_main']/table[3]/tr" />
<xsl:for-each select="$urls">
<xsl:apply-templates select="$fields">
<xsl:with-param name="key" select="." />
</xsl:apply-templates> </xsl:for-each>
</vce>
</xsl:template>
<xsl:template match="tr">
<xsl:param name="key" />
<content name="{viv:str-to-lower(viv:replace(substring-before(th,':'), '[^a-zA-Z]', '-', 'g'))}" action="none" add-to="{$key}" output-action="bold">
<xsl:value-of select="td" /> </content>
</xsl:template>
<xsl:template match="tr[not(th) or th = 'Title:']" />
<!-- <xsl:template match="tr[not(th) or th = 'Files' or th = 'Advisory Committee']" /> -->
-
Click OK to save.
Reorganize the converters so that the Enqueue Articles converter is first in the
list of converters. To do this
-
Select the number in the '#' column beside the Enqueue Articles converter.
(the left-most column of the converter list)
-
Drag the number for the Enqueue Articles converter to the beginning of the
converter list.
-
Click the Searching tab
-
Click Edit to examine the value of the Output field in the
Contents subsection.
The default value specifies that all content, with the exception of the URL, host-name,
crawled date, and snippet, will be shown in the output.
-
Click Cancel to return to the Searching tab without
making any changes.
-
Click the Overview tab.
-
If you have not previously performed a crawl, click Start to the
right of the Live Status label to begin a new crawl.
-
If you have previously performed a crawl, click Start beside the
Staging Status label.
Since all of the documents that we are crawling reside on one server, the crawl might
need a little while to complete.
-
When it finishes, click Search
The system displays a search page showing both dynamically generated snippets from the
PDF document text, and the added content elements from the description pages.
Results
Scrolling down to examine the search result for any thesis shows a rich selection of fields
in the virtual documents, including detailed information about the author, the degree
program for which the document was written, the school the author attended at Pitt,
keywords, an abstract, and so on.
To proceed with this tutorial, click Final Results Showing Virtual Documents.