Document/Content Keys

The Converting discusses relaxing the 1-1 relationship between documents and URLs. In this section, we explain how the vse-key attribute of Documents and the add-to attribute of Contents are handled. Of particular importance is the way that default values are filled in based on the anchor, URL and container XML nodes.

At a basic level, every document served by the index has a unique vse-key associated with it. If you want to add content to a document, you create a Content element that has an add-to attribute that identifies the vse-key of the destination document.

As documents are converted, they develop an anchor. For example, if you expand a tar file, each of the files that it contains will be given an anchor that starts with the URL (or anchor) of the original tar file, to which #{filename-i} is appended for each of the files in the archive.

Also, as documents are converted, new anchors can be created by specifying a fork action of fork-with-new-name, which causes a copy of the current data to be created and both branches of processing to acquire a new anchor name (by appending #n, where n is a number that increases).

By default, all vse-key attributes are URL-normalized relative to the anchor developed during conversion (which, by default, is the original fetched URL). That is, if you specify a key of fred flintstone and crawled http://vivisimo.com/news.html, the resulting key would be http://vivisimo.com:80/fred_flintstone/.

Internally, the indexed information passed to the search-engine is XML where there is a crawl-url element that contains one or more crawl-data elements which contain the actual XML to be indexed. When documents are processed from a crawl-data node, they inherit new attributes (attributes that don't already exist) from their containing crawl-data element and then inherit new attributes from the containing crawl-url element. Thus, a vse-key on a crawl-data will become the vse-key on the document, if one doesn't already exist.

The actual vse-key that is used is based on the following values, in order of priority:

  1. forced-vse-key attribute on the containing crawl-url node.
  2. vse-key attribute on the document.
  3. key attribute on the document.
  4. anchor of the containing data node, provided that the anchor is not the same as the url of the containing crawl-url node..
  5. vse-key attribute on the containing crawl-url node.
  6. url of the document.
  7. url of the fetched URL. If a single fetched URL contains more than 1 document using the fetched URL as a key, then each subsequent document using the URL as a key will have a #n sequence number appended to it to form the key.

When a content element appears within a document, the only potential key is the add-to attribute. Otherwise, it is added to the containing document node.

Finally, when a content appears outside of a document, the following priorities apply for determining the add-to key:

  1. forced-vse-key attribute on the containing crawl-url node.
  2. add-to attribute on the content.
  3. anchor of the containing data node.
  4. url of the fetched URL.

These keys will then be treated as a potentially-relative URL and normalized based on the anchor of the data. If you want to use your own keys, you should also add the attributes vse-key-normalized, forced-vse-key-normalized or vse-add-to-normalized (whichever is applicable) to disable the normalization step.

Contents will be indexed even if there is no document that contains the key. A content that does not have a containing document will never be returned as a search result but will still take time and space in the index. There is an attribute on content elements, vse-add-to-check="crawlable" that will only add the content to the index if the (normalized) key corresponds to a URL that could potentially be (but not necessarily has been) crawled by the current configuration.