Parse Source Results

Once results have been fetched from the various sources, they must be transformed into a standardized format for subsequent processing. This transformation is done by a source-specific parser. A Watson™ Explorer Engine parser converts a string into XML; in this case, it is transforming a string returned from a call to a source into XML that is a standardized representation of the search results returned. If a source returns am HTML page, then the parser must extract relevant content from the HTML. Similarly, if a source returns XML, the parser must extract content from the XML structure. Parsers are based on either XSL or regular expressions, either of which can be used effectively to find the necessary bits of information from within the HTML or XML result strings.

Watson Explorer Engine uses the document XML element as its standardized representation of an individual search result. A document has a set of content sub-elements representing the various components of a result, like title, snippet, author, and so on. A sample result looks like the following:

<document URL="http://www.epa.gov/greenpower/buygreenpower/guide.htm"
  source="greenpower" parse-ref="2" rank="6"
  score="0.066667" id="Ndoc69" base-score="0.066667">
  <content name="title" output-action="bold" type="HTML" action="cluster" weight="1.000000">
  EPA - GPP - Guide to Purchasing Green Power
  </content>
  <content name="snippet" output-action="bold" type="HTML" action="cluster" weight="1.000000">
  The U.S. EPA's Green Power Partnership is a voluntary program
  designed to reduce the environmental ... Guide to Purchasing Green
  Power. This Guide to Purchasing Green Power provides information
  about ...
  </content>
  </document>

Note that some information about the document like URL, source used to obtain it, score (for ranking) are provided as attributes, rather than content elements. See the description of the document element in the online documentation for details.

For the next processing step, see Combine Result Documents.