We have created a source for you with the form already filled out, so you can
concentrate on the parser. Click the List icon beside the Sources
menu item, and locate and open the source named BBC-Tutorial.
Select the Parser tab.
Click Add Parser Component and select XSL
Parser from the pop-up list to create a new XSL parser. This is a predefined
function used by the Watson Explorer Engine software. You only
need to provide the XPath expressions of the various nodes you wish to parse.
Look at the HTML output of a BBC News Search at http://www.bbc.co.uk/search?filter=news&q=test. You will begin to see
patterns (in the output, that is). We will use these in our XSL parser.
The first field is results-select, the XPath expression that
defines the entirety of each result. All other fields are relative to this one, and each
occurrence of this XPath expression in the search page will create a new document for
Watson Explorer Engine to cluster. For this tutorial we can
see that all results are contained in the <ol> element with
the class search-results. And the useful content for each result is
contained in each //li/article of each
<ol>. Enter //ol[@class='search-results
results']/li/article/div for the results-select.
The second field is url-select. Notice that the result URL is
always in the first h1 element of the result. Therefore, the XPath
expression to enter would be: h1/a/@href
For title-select, we want the text of the link that we
identified for the URL. We want the contents of that link, rather than its
HREF attribute, so the XPath expression will be very similar:
h1/a. We simply drop the /@href portion so
we capture the content of the tag, rather than its HREF
attribute.
The snippet-select content gives us three choices, each
contained in the <p> element. For the purpose of this
tutorial we will select the longest snippet, which happens to be the third one. As you
would expect, the XPath expression we need is p[3].
For a standard source, we would be done now, but BBC News has some additional content
that could be useful to the user and will help display differences between Regular
Expressions (Regex) and XSL. First, notice that each result is listed with the date that
it was last updated. We can capture this data and use it in the
date-select field. The date on which the item was last modified
is contained in a definition list in the <footer> element of
each result. Notice the additional space in this element; we can make sure that this
content is recognized as a date by using the
normalize-spacefunction. The XPath expression we need is:
normalize-space(footer/dl/dd/time).
Next, we can return the category of the result to the user using the
category-select field. Again, this is in the
<footer> element of the result, but this time inside a
span. The entry into the category-select field will be:
footer/dl/dd/span.
Finally, we can return the total number of results for this source by providing a value
for the total-results field. The total results information is in
the data-total-results attribute of the <ol
class='search-results results'> element. This translates to the following
XPath expression: //ol[@class='search-results
results']/@data-total-results.