One document to be clustered (input) or that has been clustered (in tree in the output)
or a duplicate (in documents in the output)
Attributes
- async (Boolean default: true) - Asynchronous processing. For parse tags,
should the request be enqueued (false) or processed before its next sibling. For other
elements, this attribute only makes a difference when they contain asynchronous requests
which need to be processed before the element is processed. In this case, when false the
element's next sibling will only be processed after the current element, when true
Watson Explorer won't wait for the current element to
be processed before processing its next sibling.
- elt-id (Integer) - Usage: Internal
- max-elt-id (Integer) - Usage: Internal
- execute-acl (Text)
- process (Text) - An XPath determining which of the attributes and/or
children will be processed. Currently only "", "*", "@*" and "*|@*" are supported.
- boost-levels (Text default: -1) - Space separated list of clustering
levels in which the document will be boosted (0 for the top level, 1 for the top level
clusters, etc). -1 can be used to specify all levels.
- boost-display (Any of: boost-and-list, boost-only default: boost-and-list)
- Specifies what to do when a document may be boosted and present in the current result
list at the same time.
- boost-and-list: Duplicate the document and put it both under a
schema.x.element.boost element and the schema.x.element.list.
- boost-only: Just move the document under a schema.x.element.boost element and
remove it from the schema.x.element.list (eventually adding to it a new lower-ranked
document to ensure the right count).
- key (NMToken) - Key to use for key-duplicate elimination (only one
document with the same key will be used for clustering).
- vse-key ( Restricted form of xs:string: Pattern [^"]+) - Key used by the
search engine while indexing to allow contents from one URL to be added to a document
generated from another URL. If you specify a vse-key on a document, you can add a content
to it by using the add-to attribute on a content.
- vse-doc-hash (Text) - A string that uniquely identifies this document.
Usage: Internal
- vse-key-normalized (May only be: vse-key-normalized) - In the search
engine, use this attribute to mark the fact that the key is already absolute and should
not be normalized. If this attribute is not specified, the key will be treated as a
relative URL and normalized with respect to the URL of that was crawled.
- vse-n-collapsed (Integer) - If key collapsing is enabled in the
search-engine, this attribute represents the number of documents that were collapsed.
- vse-auto-url (May only be: first-document) - Usage: Internal
- authorization-url (Text) - This attribute is currently only supported for
the search engine. Before returning this document, the query-service will attempt to
download this URL to verify authorization.
- default-content-acl (Text) - This attribute is used by the search-engine
to index ACLs. ACL list to apply to each content contained within this document element
that does not itself provide an acl attribute. This attribute has no effect on contents
that are not contained within this specific document element. Usage: Search engine module
only
- full-document-acl (Text) - This attribute is a place-holder and not
currently used. Usage: Search engine module only Experimental feature which may be
officially supported in a subsequent release
- cache-acl (Text) - This attribute is created by the search-engine to
indicate the ACLs for each cache document. Usage: Search engine module only
- duplicates-priority (Decimal number) - Priority to use when selecting
which of a set of duplicates to keep in the clustering. If two documents are key
duplicates or are near duplicates, the one with higher priority will be used for
clustering. If they have equal priority, the document's score will be used to select the
document to keep.
- url (Text) - A URL where the content of the document can be found. Used by
default as the title link for this document when displaying the result list.
- display-url (Text) - URL to show to the user when displaying the result
list, if it needs to be different from the actual URL in the title link (for example,
sponsored listing usually have two different URLs, the actual link URL being
inintelligible to the end-user).
- rank (Text) - Rank of the document in the results. Not used if score is
also specified.
- score (Decimal number) - Score of this document in the ranked list. If
unspecified, it will be computed based on the rank.
- matched (Integer default: 0) - Has this document been matched during a
sub-search. Usage: Read-only
- binned (Text) - Has this document been matched during binning. Usage:
Read-only
- source (NMToken) - The name of the source from which this document comes
from.
- display-source (Text) - The display-name of the source from which this
document comes from.
- source-url (URI) - Usage: This functionality is deprecated
- source-type (Text) - Informal label describing the type of information
contained in the document.
- parse-ref (NMToken) - The id of the schema.x.element.parse element from
which this document has been extracted.
- id (NMToken) - Unique identifier on each document. Specifying IDs on input
is a deprecated feature of the software.
- paid (Boolean) - Bid price for the paid listing or unspecified for normal
results.
- original-id (NMToken) - Generated by the clustering engine whenever a
second or later copy of the document was inserted into the tree. This is the id attribute
of the original document that this is a copy of.
- cache (Text) - URL that points to a cached copy of this document.
- base-score (Decimal number) - Original score before any meta-searching
score adjustments are computed for the document.
- vse-base-score (Decimal number) - Specifies (when debugging is enabled)
the ranking (base) score for the document. Usage: Search engine module only
- la-score (Decimal number) - Specify a link-analysis score for a document
that is being indexed by the search-engine. If specified in any other context, it will
have no effect. Usage: Search engine module only
- la-score-multiplier (Decimal number) - Specify a multiplier for the
generated la-score when indexing documents in the search-engine. If specified in any other
context, it will have no effect. Usage: Search engine module only
- vertex (Integer) - Used internal to associate a document with the URL that
generated it. Usage: Search engine module only
- query (Text) - The name of the query from which this document is issued.
Usage: Read-only
- duplicates (NMTokens) - List of documents that are duplicates of this
document.
- duplicate-of (NMToken) - Identifier of the document that this is a
duplicate of (or unspecified if this document has is not a duplicate).
- duplicate-type (Any of: near, key) - key duplicates are detected based on
the key attribute while near duplicates are documents that have very similar contents (see
the option near_duplicates).
- boost-name (NMToken) - When specified and when the document is part of the
document list associated with the active schema.x.element.node, the document will be
"boosted", i.e., added to a schema.x.element.boost element with the specified name in the
rendering state. See also the boost-levels and boost-display attributes which affect when
and how this boosting is happening.
- boost-score (Decimal number) - Score used to sort boost documents inside
their containing schema.x.element.boost element.
- collapse-key (NMToken) - A key used for "site-collapsing". If not
specified, the host of the URL is used.
- collapse-type (Any of: first, subsequent, hidden) - When documents are
collapsed, they should be presented differently based on their rank in the collapsed set.
- first: The first document shown in a collapsed set. documents.
- subsequent: Subsequent documents in a collapsed set.
- hidden: Documents in a collapsed set that should be hidden by
default.
- stub (May only be: stub) - Mark this document as a stub document
containing no contents and a limited set of attributes. Usage: Internal
- boosted (May only be: boosted) - Usage: This functionality is
deprecated
- cookie-jar (Text) - Specify a set of cookies that needs to be passed when
retrieving this document. Given that due to security limitations it is not possible to set
cookies for a third-party site, this will trigger the use of the "proxy" mode in the
default setup (i.e., the user will download the search result through the query-meta
script and not directly).
- headers (Text) - Specify a set of headers that needs to be passed when
retrieving this document. Given that due to security limitations it is not possible to set
request headers in the browsers, this will trigger the use of the "proxy" mode in the
default setup.
- parser (Text) - Specify the name of a parser that needs to be applied when
retrieving this document. This will trigger the use of the "proxy" mode in the default
setup.
- top-paid (May only be: top-paid) - Usage: Internal
- mwi-shingle (NMToken) - Usage: Internal
- vse (May only be: vse) - Flag specifying that the document comes from a
Watson Explorer Engine source. Usage: Internal
- vse-key-check
- Any user-defined attribute
Children
- Use these in the listed order. The sequence may not repeat.
- sort-key: (Zero or more) - Keys used to sort documents using new XML tags
- vse-index-stream: (Zero or more) - Specifies how the content text will be tokenized and
how terms will be created from the tokenized result.
- content: (Zero or more) - Contents hold the data for clustering
- advanced-content: (Zero or more) - A container object that allows specification of both
textual content and how that text will be tokenized and converted to terms to be
indexed.
- cache: (Zero or more) - Indicates that this document has text cache data
available.
- vse-collapsed: (At most 1) - A container element for documents that have been
collapsed by the search-engine using the key collapsing feature.
- duplicate-documents: (Zero or more) - A container element for the key- and near-duplicates
documents to the containing document.
Examples
Input Example:
<document url="url">
<content name="title" type="html" action="cluster-bold" weight="3">
<_cdata_>
... the document title would appear here ....
</_cdata_>
</content>
<content name="snippet" type="html" action="cluster" output-action="bold">
<_cdata_>
... the document summary would appear here ....
</_cdata_>
</content>
</document>
Output Example:
<tree>
<node id="N0" level="0" sep="0.0000" cohesion="0.0000" ndocs="1" score="0.000000" instances="Ndoc0">
<document url="url" id="Ndoc0">
<content name="title" type="html" action="discard" weight="3.000000">
... the document title would appear here ....
</content>
<content name="title" type="html" action="cluster" weight="3.000000" output-action="bold">
<_cdata_>
... the document title would appear here ....
</_cdata_>
</content>
<content name="snippet" type="html" action="discard" weight="1.000000">
... the document summary would appear here ....
</content>
<content name="snippet" type="html" action="cluster" output-action="bold" weight="1.000000">
<_cdata_>
... the document summary would appear here ....
</_cdata_>
</content>
</document>
</node>
</tree>
Input Example:
<vce>
<meta query="companies"/>
<document url="http://vivisimo.com/">
<content name="title" type="html" action="cluster-bold" weight="3">
<_cdata_>
Vivisimo
</_cdata_>
</content>
<content name="snippet" type="html" action="cluster-bold">
<_cdata_>
Groups the results by topic via document clustering technology.
Options include Web or news search, selection of sources, language
restriction, and filtering.
</_cdata_>
</content>
</document>
<document url="http://sportsillustrated.cnn.com/hockey/">
<content name="title" type="html" action="cluster-bold" weight="3">
<_cdata_>
CNN/SI: Hockey
</_cdata_>
</content>
<content name="snippet" type="html" action="cluster-bold">
<_cdata_>
Daily news, scores, feature stories, statistics, standings, player
profiles, polls, and chat.
</_cdata_>
</content>
</document>
</vce>
Output Example:
<meta query="companies"/>
<tree>
<node id="N2" level="0" sep="0.0000" cohesion="0.0000" ndocs="2" score="0.000000" instances="Ndoc0 Ndoc1">
<node id="N0" level="1" sep="0.0000" cohesion="0.0000" ndocs="1" score="0.000000" instances="Ndoc0">
<descriptor string="Vivisimo" sep="1.000000" ratio="1.000000"/>
<descriptor string="Sources" sep="0.577350" ratio="1.000000"/>
<document url="http://vivisimo.com/" id="Ndoc0">
<content name="title" type="html" action="discard" weight="3.000000">
<span class=b1>Vivisimo</span>
</content>
<content name="title" type="html" action="cluster" weight="3.000000" output-action="bold">
<_cdata_>
Vivisimo
</_cdata_>
</content>
<content name="snippet" type="html" action="discard" weight="1.000000">
Groups the results by topic via document clustering technology.
Options include Web or news search, selection of <span class=b1>sources</span>, language
restriction, and filtering.
</content>
<content name="snippet" type="html" action="cluster" output-action="bold" weight="1.000000">
<_cdata_>
Groups the results by topic via document clustering technology.
Options include Web or news search, selection of sources, language
restriction, and filtering.
</_cdata_>
</content>
</document>
</node>
<node id="N1" level="1" sep="0.0000" cohesion="0.0000" ndocs="1" score="0.000000" instances="Ndoc1">
<descriptor string="Hockey" sep="1.000000" ratio="1.000000"/>
<descriptor string="Statistics" sep="0.577350" ratio="1.000000"/>
<document url="http://sportsillustrated.cnn.com/hockey/" id="Ndoc1">
<content name="title" type="html" action="discard" weight="3.000000">
CNN/SI: <span class=b1>Hockey</span>
</content>
<content name="title" type="html" action="cluster" weight="3.000000" output-action="bold">
<_cdata_>
CNN/SI: Hockey
</_cdata_>
</content>
<content name="snippet" type="html" action="discard" weight="1.000000">
Daily news, scores, feature stories, <span class=b1>statistics</span>, standings, player
profiles, polls, and chat.
</content>
<content name="snippet" type="html" action="cluster" output-action="bold" weight="1.000000">
<_cdata_>
Daily news, scores, feature stories, statistics, standings, player
profiles, polls, and chat.
</_cdata_>
</content>
</document>
</node>
</node>
</tree>