crawl-data

Node that encapsulates all crawler state corresponding to a particular document.

Attributes

  • encoding (Any of: base64, xml, qp-utf8, text) - Usage: Internal
  • db-row (xs:unsignedInt) - The row in the cache table for the text cache content for this crawl-data.
  • rich-db-row (xs:unsignedInt) - The row in the cache table for the rich cache content for this crawl-data.
  • db-file (Text) - The file path for the cache content for this crawl-data.
  • content-type (Text) - The type of content that this crawl-data stores.
  • anchor (Text) - If specified, this indicates this document's location within a URL.
  • url (Text) - When a document in the crawl-data is indexed, the default URL (if not specified as part of the data) is this attribute (or if not specified it is the url attribute on the crawl-url).
  • light-crawler-url (Text) - When indexing data using the light-crawler mode, this attribute may be placed on the crawl-data element to disassociate this data from the parent crawl-url. Effectively, this allows updates to be batched.
  • light-crawler-enqueue-id (Text) -
  • light-crawler-arena (Text) - When indexing data using the light-crawler mode, this attribute may be placed on the crawl-data element in conjunction with the light-crawler-url to indicate that this disassociated set of data is in a different arena.
  • error (Text) - If an error occurred while converting this document, this attribute will describe the error.
  • fallback (Text) - If a fallback was triggered, this is the content-type that caused the fallback to occur. If multiple fallbacks occur, this will be the last content-type that failed. Usage: Internal
  • fallback-error (Text) - When a fallback occurs, the error attribute is moved to the fallback-error attribute to record the problem and clear the error state. Usage: Internal
  • acl (Text) - The value of the ACL for this document.
  • i (Integer default: 0) - Associates the indexed crawl-data with the cached crawl-data. The value is the index of the corresponding cached crawl-data. Usage: Internal
  • ct (Text) - Indicates that this crawl-data has corresponding text cache content, and that the text cache content is of this type. Usage: Internal
  • rich-ct (Text) - Indicates that this crawl-data has corresponding rich cache content, and that the rich cache content is of this type. Usage: Internal
  • input (May only be: input) - Temporary flag used to distinguish crawl-datas present on the input crawl-url from those that are fetched by the crawler. Usage: Internal

Children

  • Use these in the listed order. The sequence may not repeat.
    • xml: (Exactly 1) Children
      • Use these in the listed order. The sequence may not repeat.
        • From 0 to 1 XML nodes of any kind can be used in this context.
    • vxml: (Exactly 1) Children
      • Use these in the listed order. The sequence may not repeat.
        • document: (Zero or more) - One document to be clustered (input) or that has been clustered (in tree in the output) or a duplicate (in documents in the output)
        • content: (Zero or more) - Contents hold the data for clustering
        • advanced-content: (Zero or more) - A container object that allows specification of both textual content and how that text will be tokenized and converted to terms to be indexed.
    • base64: (Exactly 1) Type: Text
    • qp-utf8: (Exactly 1) Type: Text
    • text: (Exactly 1) Type: Text