Writing an XSL Parser

About this task

Let's try parsing the output from BBC News.

Procedure

  1. We have created a source for you with the form already filled out, so you can concentrate on the parser. Click the List icon beside the Sources menu item, and locate and open the source named BBC-Tutorial.
  2. Select the Parser tab.
  3. Click Add Parser Component and select XSL Parser from the pop-up list to create a new XSL parser. This is a predefined function used by the Watson Explorer Engine software. You only need to provide the XPath expressions of the various nodes you wish to parse.
  4. Look at the HTML output of a BBC News Search at http://www.bbc.co.uk/search?filter=news&q=test. You will begin to see patterns (in the output, that is). We will use these in our XSL parser.
  5. The first field is results-select, the XPath expression that defines the entirety of each result. All other fields are relative to this one, and each occurrence of this XPath expression in the search page will create a new document for Watson Explorer Engine to cluster. For this tutorial we can see that all results are contained in the <ol> element with the class search-results. And the useful content for each result is contained in each //li/article of each <ol>. Enter //ol[@class='search-results results']/li/article/div for the results-select.
  6. The second field is url-select. Notice that the result URL is always in the first h1 element of the result. Therefore, the XPath expression to enter would be: h1/a/@href
  7. For title-select, we want the text of the link that we identified for the URL. We want the contents of that link, rather than its HREF attribute, so the XPath expression will be very similar: h1/a. We simply drop the /@href portion so we capture the content of the tag, rather than its HREF attribute.
  8. The snippet-select content gives us three choices, each contained in the <p> element. For the purpose of this tutorial we will select the longest snippet, which happens to be the third one. As you would expect, the XPath expression we need is p[3].
  9. For a standard source, we would be done now, but BBC News has some additional content that could be useful to the user and will help display differences between Regular Expressions (Regex) and XSL. First, notice that each result is listed with the date that it was last updated. We can capture this data and use it in the date-select field. The date on which the item was last modified is contained in a definition list in the <footer> element of each result. Notice the additional space in this element; we can make sure that this content is recognized as a date by using the normalize-spacefunction. The XPath expression we need is: normalize-space(footer/dl/dd/time).
  10. Next, we can return the category of the result to the user using the category-select field. Again, this is in the <footer> element of the result, but this time inside a span. The entry into the category-select field will be: footer/dl/dd/span.
  11. Finally, we can return the total number of results for this source by providing a value for the total-results field. The total results information is in the data-total-results attribute of the <ol class='search-results results'> element. This translates to the following XPath expression: //ol[@class='search-results results']/@data-total-results.
  12. Click OK and test the source.

Results

To proceed with this tutorial, click Writing a Regular Expression Parser.