document

One document to be clustered (input) or that has been clustered (in tree in the output) or a duplicate (in documents in the output)

Attributes

async (Boolean default: true) - Asynchronous processing. For parse tags, should the request be enqueued (false) or processed before its next sibling. For other elements, this attribute only makes a difference when they contain asynchronous requests which need to be processed before the element is processed. In this case, when false the element's next sibling will only be processed after the current element, when true Watson Explorer won't wait for the current element to be processed before processing its next sibling.
elt-id (Integer) - Usage: Internal
max-elt-id (Integer) - Usage: Internal
execute-acl (Text)
process (Text) - An XPath determining which of the attributes and/or children will be processed. Currently only "", "*", "@*" and "*|@*" are supported.
boost-levels (Text default: -1) - Space separated list of clustering levels in which the document will be boosted (0 for the top level, 1 for the top level clusters, etc). -1 can be used to specify all levels.
boost-display (Any of: boost-and-list, boost-only default: boost-and-list) - Specifies what to do when a document may be boosted and present in the current result list at the same time.
- boost-and-list: Duplicate the document and put it both under a schema.x.element.boost element and the schema.x.element.list.
- boost-only: Just move the document under a schema.x.element.boost element and remove it from the schema.x.element.list (eventually adding to it a new lower-ranked document to ensure the right count).
key (NMToken) - Key to use for key-duplicate elimination (only one document with the same key will be used for clustering).
vse-key ( Restricted form of xs:string: Pattern [^"]+) - Key used by the search engine while indexing to allow contents from one URL to be added to a document generated from another URL. If you specify a vse-key on a document, you can add a content to it by using the add-to attribute on a content.
vse-doc-hash (Text) - A string that uniquely identifies this document. Usage: Internal
vse-key-normalized (May only be: vse-key-normalized) - In the search engine, use this attribute to mark the fact that the key is already absolute and should not be normalized. If this attribute is not specified, the key will be treated as a relative URL and normalized with respect to the URL of that was crawled.
vse-n-collapsed (Integer) - If key collapsing is enabled in the search-engine, this attribute represents the number of documents that were collapsed.
vse-auto-url (May only be: first-document) - Usage: Internal
authorization-url (Text) - This attribute is currently only supported for the search engine. Before returning this document, the query-service will attempt to download this URL to verify authorization.
default-content-acl (Text) - This attribute is used by the search-engine to index ACLs. ACL list to apply to each content contained within this document element that does not itself provide an acl attribute. This attribute has no effect on contents that are not contained within this specific document element. Usage: Search engine module only
full-document-acl (Text) - This attribute is a place-holder and not currently used. Usage: Search engine module only Experimental feature which may be officially supported in a subsequent release
cache-acl (Text) - This attribute is created by the search-engine to indicate the ACLs for each cache document. Usage: Search engine module only
duplicates-priority (Decimal number) - Priority to use when selecting which of a set of duplicates to keep in the clustering. If two documents are key duplicates or are near duplicates, the one with higher priority will be used for clustering. If they have equal priority, the document's score will be used to select the document to keep.
url (Text) - A URL where the content of the document can be found. Used by default as the title link for this document when displaying the result list.
display-url (Text) - URL to show to the user when displaying the result list, if it needs to be different from the actual URL in the title link (for example, sponsored listing usually have two different URLs, the actual link URL being inintelligible to the end-user).
rank (Text) - Rank of the document in the results. Not used if score is also specified.
score (Decimal number) - Score of this document in the ranked list. If unspecified, it will be computed based on the rank.
matched (Integer default: 0) - Has this document been matched during a sub-search. Usage: Read-only
binned (Text) - Has this document been matched during binning. Usage: Read-only
source (NMToken) - The name of the source from which this document comes from.
display-source (Text) - The display-name of the source from which this document comes from.
source-url (URI) - Usage: This functionality is deprecated
source-type (Text) - Informal label describing the type of information contained in the document.
parse-ref (NMToken) - The id of the schema.x.element.parse element from which this document has been extracted.
id (NMToken) - Unique identifier on each document. Specifying IDs on input is a deprecated feature of the software.
paid (Boolean) - Bid price for the paid listing or unspecified for normal results.
original-id (NMToken) - Generated by the clustering engine whenever a second or later copy of the document was inserted into the tree. This is the id attribute of the original document that this is a copy of.
cache (Text) - URL that points to a cached copy of this document.
base-score (Decimal number) - Original score before any meta-searching score adjustments are computed for the document.
vse-base-score (Decimal number) - Specifies (when debugging is enabled) the ranking (base) score for the document. Usage: Search engine module only
la-score (Decimal number) - Specify a link-analysis score for a document that is being indexed by the search-engine. If specified in any other context, it will have no effect. Usage: Search engine module only
la-score-multiplier (Decimal number) - Specify a multiplier for the generated la-score when indexing documents in the search-engine. If specified in any other context, it will have no effect. Usage: Search engine module only
vertex (Integer) - Used internal to associate a document with the URL that generated it. Usage: Search engine module only
query (Text) - The name of the query from which this document is issued. Usage: Read-only
duplicates (NMTokens) - List of documents that are duplicates of this document.
duplicate-of (NMToken) - Identifier of the document that this is a duplicate of (or unspecified if this document has is not a duplicate).
duplicate-type (Any of: near, key) - key duplicates are detected based on the key attribute while near duplicates are documents that have very similar contents (see the option near_duplicates).
boost-name (NMToken) - When specified and when the document is part of the document list associated with the active schema.x.element.node, the document will be "boosted", i.e., added to a schema.x.element.boost element with the specified name in the rendering state. See also the boost-levels and boost-display attributes which affect when and how this boosting is happening.
boost-score (Decimal number) - Score used to sort boost documents inside their containing schema.x.element.boost element.
collapse-key (NMToken) - A key used for "site-collapsing". If not specified, the host of the URL is used.
collapse-type (Any of: first, subsequent, hidden) - When documents are collapsed, they should be presented differently based on their rank in the collapsed set.
- first: The first document shown in a collapsed set. documents.
- subsequent: Subsequent documents in a collapsed set.
- hidden: Documents in a collapsed set that should be hidden by default.
stub (May only be: stub) - Mark this document as a stub document containing no contents and a limited set of attributes. Usage: Internal
boosted (May only be: boosted) - Usage: This functionality is deprecated
cookie-jar (Text) - Specify a set of cookies that needs to be passed when retrieving this document. Given that due to security limitations it is not possible to set cookies for a third-party site, this will trigger the use of the "proxy" mode in the default setup (i.e., the user will download the search result through the query-meta script and not directly).
headers (Text) - Specify a set of headers that needs to be passed when retrieving this document. Given that due to security limitations it is not possible to set request headers in the browsers, this will trigger the use of the "proxy" mode in the default setup.
parser (Text) - Specify the name of a parser that needs to be applied when retrieving this document. This will trigger the use of the "proxy" mode in the default setup.
top-paid (May only be: top-paid) - Usage: Internal
mwi-shingle (NMToken) - Usage: Internal
vse (May only be: vse) - Flag specifying that the document comes from a Watson Explorer Engine source. Usage: Internal
vse-key-check
Any user-defined attribute

Children

Use these in the listed order. The sequence may not repeat.
- sort-key: (Zero or more) - Keys used to sort documents using new XML tags
- vse-index-stream: (Zero or more) - Specifies how the content text will be tokenized and how terms will be created from the tokenized result.
- content: (Zero or more) - Contents hold the data for clustering
- advanced-content: (Zero or more) - A container object that allows specification of both textual content and how that text will be tokenized and converted to terms to be indexed.
- cache: (Zero or more) - Indicates that this document has text cache data available.
- vse-collapsed: (At most 1) - A container element for documents that have been collapsed by the search-engine using the key collapsing feature.
- duplicate-documents: (Zero or more) - A container element for the key- and near-duplicates documents to the containing document.

Examples

Input Example:

  <document url="url">
  <content name="title" type="html" action="cluster-bold" weight="3">
  <_cdata_>
  ... the document title would appear here ....
  </_cdata_>
  </content>
  <content name="snippet" type="html" action="cluster" output-action="bold">
  <_cdata_>
  ... the document summary would appear here ....
  </_cdata_>
  </content>
  </document>

Output Example:

  <tree>
  <node id="N0" level="0" sep="0.0000" cohesion="0.0000" ndocs="1" score="0.000000" instances="Ndoc0">
  <document url="url" id="Ndoc0">
  <content name="title" type="html" action="discard" weight="3.000000">
  ... the document title would appear here ....
  </content>
  <content name="title" type="html" action="cluster" weight="3.000000" output-action="bold">
  <_cdata_>
  ... the document title would appear here ....
  </_cdata_>
  </content>
  <content name="snippet" type="html" action="discard" weight="1.000000">
  ... the document summary would appear here ....
  </content>
  <content name="snippet" type="html" action="cluster" output-action="bold" weight="1.000000">
  <_cdata_>
  ... the document summary would appear here ....
  </_cdata_>
  </content>
  </document>
  </node>
  </tree>

Input Example:

  <vce>
  <meta query="companies"/>
  <document url="http://vivisimo.com/">
  <content name="title" type="html" action="cluster-bold" weight="3">
  <_cdata_>
  Vivisimo
  </_cdata_>
  </content>
  <content name="snippet" type="html" action="cluster-bold">
  <_cdata_>
  Groups the results by topic via document clustering technology.
  Options include Web or news search, selection of sources, language
  restriction, and filtering.
  </_cdata_>
  </content>
  </document>
  <document url="http://sportsillustrated.cnn.com/hockey/">
  <content name="title" type="html" action="cluster-bold" weight="3">
  <_cdata_>
  CNN/SI: Hockey
  </_cdata_>
  </content>
  <content name="snippet" type="html" action="cluster-bold">
  <_cdata_>
  Daily news, scores, feature stories, statistics, standings, player
  profiles, polls, and chat.
  </_cdata_>
  </content>
  </document>
  </vce>

Output Example:

  <meta query="companies"/>
  <tree>
  <node id="N2" level="0" sep="0.0000" cohesion="0.0000" ndocs="2" score="0.000000" instances="Ndoc0 Ndoc1">
  <node id="N0" level="1" sep="0.0000" cohesion="0.0000" ndocs="1" score="0.000000" instances="Ndoc0">
  <descriptor string="Vivisimo" sep="1.000000" ratio="1.000000"/>
  <descriptor string="Sources" sep="0.577350" ratio="1.000000"/>
  <document url="http://vivisimo.com/" id="Ndoc0">
  <content name="title" type="html" action="discard" weight="3.000000">
  <span class=b1>Vivisimo</span>
  </content>
  <content name="title" type="html" action="cluster" weight="3.000000" output-action="bold">
  <_cdata_>
  Vivisimo
  </_cdata_>
  </content>
  <content name="snippet" type="html" action="discard" weight="1.000000">
  Groups the results by topic via document clustering technology.
  Options include Web or news search, selection of <span class=b1>sources</span>, language
  restriction, and filtering.
  </content>
  <content name="snippet" type="html" action="cluster" output-action="bold" weight="1.000000">
  <_cdata_>
  Groups the results by topic via document clustering technology.
  Options include Web or news search, selection of sources, language
  restriction, and filtering.
  </_cdata_>
  </content>
  </document>
  </node>
  <node id="N1" level="1" sep="0.0000" cohesion="0.0000" ndocs="1" score="0.000000" instances="Ndoc1">
  <descriptor string="Hockey" sep="1.000000" ratio="1.000000"/>
  <descriptor string="Statistics" sep="0.577350" ratio="1.000000"/>
  <document url="http://sportsillustrated.cnn.com/hockey/" id="Ndoc1">
  <content name="title" type="html" action="discard" weight="3.000000">
  CNN/SI: <span class=b1>Hockey</span>
  </content>
  <content name="title" type="html" action="cluster" weight="3.000000" output-action="bold">
  <_cdata_>
  CNN/SI: Hockey
  </_cdata_>
  </content>
  <content name="snippet" type="html" action="discard" weight="1.000000">
  Daily news, scores, feature stories, <span class=b1>statistics</span>, standings, player
  profiles, polls, and chat.
  </content>
  <content name="snippet" type="html" action="cluster" output-action="bold" weight="1.000000">
  <_cdata_>
  Daily news, scores, feature stories, statistics, standings, player
  profiles, polls, and chat.
  </_cdata_>
  </content>
  </document>
  </node>
  </node>
  </tree>