crawl-data
Node that encapsulates all crawler state corresponding to a particular document.
Attributes
- encoding (Any of: base64, xml, qp-utf8, text) - Usage: Internal
- db-row (xs:unsignedInt) - The row in the cache table for the text cache content for this crawl-data.
- rich-db-row (xs:unsignedInt) - The row in the cache table for the rich cache content for this crawl-data.
- db-file (Text) - The file path for the cache content for this crawl-data.
- content-type (Text) - The type of content that this crawl-data stores.
- anchor (Text) - If specified, this indicates this document's location within a URL.
- url (Text) - When a document in the crawl-data is indexed, the default URL (if not specified as part of the data) is this attribute (or if not specified it is the url attribute on the crawl-url).
- light-crawler-url (Text) - When indexing data using the light-crawler mode, this attribute may be placed on the crawl-data element to disassociate this data from the parent crawl-url. Effectively, this allows updates to be batched.
- light-crawler-enqueue-id (Text) -
- light-crawler-arena (Text) - When indexing data using the light-crawler mode, this attribute may be placed on the crawl-data element in conjunction with the light-crawler-url to indicate that this disassociated set of data is in a different arena.
- error (Text) - If an error occurred while converting this document, this attribute will describe the error.
- fallback (Text) - If a fallback was triggered, this is the content-type that caused the fallback to occur. If multiple fallbacks occur, this will be the last content-type that failed. Usage: Internal
- fallback-error (Text) - When a fallback occurs, the error attribute is moved to the fallback-error attribute to record the problem and clear the error state. Usage: Internal
- acl (Text) - The value of the ACL for this document.
- i (Integer default: 0) - Associates the indexed crawl-data with the cached crawl-data. The value is the index of the corresponding cached crawl-data. Usage: Internal
- ct (Text) - Indicates that this crawl-data has corresponding text cache content, and that the text cache content is of this type. Usage: Internal
- rich-ct (Text) - Indicates that this crawl-data has corresponding rich cache content, and that the rich cache content is of this type. Usage: Internal
- input (May only be: input) - Temporary flag used to distinguish crawl-datas present on the input crawl-url from those that are fetched by the crawler. Usage: Internal
Children
- Use these in the listed order. The sequence may not repeat.
- xml: (Exactly 1)
Children
- Use these in the listed order. The sequence may not repeat.
- From 0 to 1 XML nodes of any kind can be used in this context.
- Use these in the listed order. The sequence may not repeat.
- vxml: (Exactly 1)
Children
- Use these in the listed order. The sequence may not repeat.
- document: (Zero or more) - One document to be clustered (input) or that has been clustered (in tree in the output) or a duplicate (in documents in the output)
- content: (Zero or more) - Contents hold the data for clustering
- advanced-content: (Zero or more) - A container object that allows specification of both textual content and how that text will be tokenized and converted to terms to be indexed.
- Use these in the listed order. The sequence may not repeat.
- base64: (Exactly 1) Type: Text
- qp-utf8: (Exactly 1) Type: Text
- text: (Exactly 1) Type: Text
- xml: (Exactly 1)
Children