Clustering a Set of Web Pages

Our first scenario involves clustering a set of web pages (a set of text documents is handled similarly). The outermost format of the XML input file will be:

<?xml version="1.0" encoding="UTF-8" />
<vce>
  <_xml_>(document goes here)</_xml_>
</vce>

In the XML code, we need to distinguish each document, and 'contents' for each document. Each content represents a certain field in the document. Each document is represented by XML markup in the following format:

<document>
  <content type="html" action="cluster" output-action="summarize">
    <_cdata_>(document text here)</_cdata_>
  </content>
</document>

A few rules that restrict what may appear in the document text:

  1. All characters must be encoded using UTF-8. ASCII characters are already "encoded" in UTF-8.
  2. ]] > may not appear. You must write: ]]] ><![CDATA[]> instead.
  3. The following Unicode characters may not appear in an XML document:
    • Lower than 31, other than TAB (9) and CR/LF (13, 10).
    • The Unicode character 0xFFFE.

In Watson™ Explorer Engine XML, tags (e.g., document and content) are called elements and they may have attributes (such as type and action). The text values of attributes are subject to the three rules above, plus another rule: special characters must be escaped. Special characters are newline characters (character 13 and 10), and & < " '. These characters can be included in attribute values by writing &#__;, substituting the character code that you are escaping for the blank.

In the XML above, notice that the action attribute on the content node is set to cluster and the output-action is summarize. The action specifies that this content will be used for clustering. The output-action specifies that this content's information will be used to generate a summary in the output.

A very simple example input is a pair of documents:

<?xml version="1.0" encoding="UTF-8" />
<vce>
  <document>
    <content type="html" action="cluster" output-action="summarize">
      <_cdata_>
        <h2>This is an example</h2>
        This web page is going to be clustered.
      </_cdata_>
    </content>
  </document>
  <document>
    <content type="text" action="cluster" output-action="summarize">
      <_cdata_>
        This text document is also going to be clustered.
      </_cdata_>
    </content>
  </document>

Other attributes (some used above) that may appear on the document and content elements are documented in the schema summary in the online documentation.

To proceed with this tutorial, click Clustering Search Engine Results.