crawl-url
A node that encapsulates all crawler state for a particular URL.
Attributes
- __internal__ (May only be: dns_robots) - Flag indicating that the crawl-url is used for internal processing only and should not contribute to the logs or the index. A typical use for these crawl-urls is fetching the robots.txt file. Usage: Internal
- n-redirs (Integer default: 0) - Number of times the crawler was redirected to arrive at a particular URL.
- orig-url (Text) - The URL that the crawler orignally tried to crawl before being redirected to the one stored in the url attribute.
- url (Text) - The URL of the resource represented by this crawl-url.
- crawl-url (Text) - Usage: Internal
- redir-from (Text) - The URL that this URL was redirected from.
- redir-to (Text) - The URL that this URL was redirected to.
- state (Any of: pending, success, warning, error) - The state of this crawl-url. This attribute is set by the status query and can have the following possible values: pending: the resource is currently in the crawler pipeline. success: the resource exited the crawler pipeline and processing did not result in an error or warning. warning: the resource exited the crawler pipeline but some data was not successfully indexed. error: the resource exited the crawler pipeline but no data was successfully indexed. If the enqueue did not reach the indexer, the siphoned attribute will indicate why. Otherwise the log child will indicate the errors.
- status (Any of: starting, applying changes, stopping, refreshing, resuming, input, complete, redir, disallowed by robots.txt, filtered, error, duplicate, killed, none default: none) - The status of this crawl-url: input: the resource is in the process of being fetched by the crawler, complete: the resource was successfully fetched and converted, redir: attemtping to fetch the resource resuled in a redirection, disallowed by robots.txt: the robots.txt file does not allow the resource to be crawled, filtered: the crawl conditions do not allow the resource to be crawled, but the crawl-url will be recorded in the logs. killed: the crawl conditions do not allow the resource to be crawled, and the crawl-url will not be recorded in the logs. error: there was an error attempting to fetch or convert the URL. duplicate: the resource is an exact duplicate of a resource that was previously crawled. Other possible status values, such as applying changes, refreshing, resuming, starting, and stopping are for internal use only.
- output-destination (Any of: cache, indexer) - String used to route a crawl-url to a particular destination in the crawler. Usage: Internal
- http-status (xs:long) - The HTTP status code returned when the crawler attempted to fetch the resource at this URL.
- input-at (xs:long) - Seconds since the epoch at which this crawl-url was written to the logs in its input state.
- recorded-at (xs:long) - Seconds since the epoch at which this crawl-url was written to the logs in its input state.
- at-datetime (Date/Time) - Time at which this crawl-url was fetched or failed to be fetched.
- at (xs:long) - Seconds since the epoch at which this crawl-url was fetched or failed to be fetched.
- filetime (xs:long) - Seconds since the epoch at which the resource at this URL was last updated.
- batch-id (Text) - Value used to determine if multiple URLs with a shared vse-key were enqueued during a single crawler instance.
- change-id (Text) - A protocol-dependent checksum of various metadata that is used to determine whether the resource at this URL has changed since the last crawl. Usage: Internal
- input-purged - A flag indicating that the input crawl-url for this crawl-url is not available because it has been previously purged from the database. Usage: Internal
- content-type (Text) - The content type of the resource at this URL.
- size (xs:long) - Total size in bytes of the fetched resource at this URL.
- n-sub (Integer) - Total number of schema.x.element.crawl-data children under this crawl-url.
- conversion-time (Integer) - Total time in seconds spent converting the fetched resource at this URL.
- converted-size (Decimal number) - Total size in bytes after performing all conversion steps on the fetched resource at this URL.
- speed (Decimal number) - Total size in bytes of the resource at this URL divided by the total fetch time in seconds.
- error (Text) - A string indicating the reason why the crawler was unable to fetch or convert the resource at this URL.
- warning (Text) - A string indicating any problems the crawler encountered while attempting to fetch or convert the resource at this URL.
- hops (Integer default: 0) - The number of hops between this URL and the seed URL that caused the crawler to first encounter it.
- vertex (xs:unsignedInt) - A unique identifier that the crawler assigns to every URL that it encounters.
- exact (Text) - A string representing a checksum on the contents of all schema.x.element.crawl-data children and their ACL attributes. Used for determining whether a resource at a URL is an exact duplicate of a resource at a previously crawled URL.
- error-msg (Text) - Used to temporarily pass around error messages in the crawler. Usage: Internal
- exact-duplicate (May only be: exact-duplicate) - A flag indicating that the URL's content is an exact duplicate of a previously crawled URL.
- verbose (May only be: verbose) - The presence of this flag indicates that the progress of this crawl-url should be tracked in the crawler's debugging log.
- uncrawled (Any of: unexpired, unchanged, error, unknown) - A general reason why this resource was not recrawled: unexpired: a previous copy of this crawl-url was still valid at the time of the crawl, unchanged: metadata indicated that the resource at this URL was unchanged since it was last fetched, error: attempting to fetch the resource at this URL yielded an error and the previously fetched copy was still valid at the time of the crawl, unknown: the resource was not recrawled for an unknown reason. Used in conjunction with the uncrawled-why attribute.
- uncrawled-why (Text) - A specific reason why this resource was not recrawled. Used in conjunction with the uncrawled attribute.
- crawled-locally (May only be: crawled-locally) - Flag used to indicate that no contact with a remote server was required and that this URL shouldn't be involved in the delay computation.
- priority (Integer default: 0) - An integer indicating the priority of this crawl-url relative to other crawl-urls and crawl-deletes in the crawler's queues. A larger value indicates a higher priority.
- input-priority (Integer) - Stores the actual priority on URLs enqueued by the resume operation. Set internally by the crawler for temporary use. Usage: Internal
- default-acl (Text) - The ACL that will be applied to the resource when no other ACL is available.
- ip (Text) - The IP address used to fetch the resource at the URL.
- i-ip (Integer) - Integer identifier used to associate a particular IP address with a DNS entry that corresponds to multiple IP addresses.
- forced-vse-key (Text) - Force the crawler to assign this vse-key to this crawl-url rather than allow the crawler to assign one automatically.
- forced-vse-key-normalized (May only be: forced-vse-key-normalized) - Flag that indicates the forced-vse-key should not be normalized by automatically including the base-url in its value.
- synchronization (Any of: none, enqueued, to-be-indexed, indexed,
indexed-no-sync Deprecated values: to-be-crawled default: enqueued) - Indicates at
which point the crawler should return success for an enqueued crawl-url. All
synchronizations other than none will cause the enqueue to be committed to secondary
storage before a synchronous reply is issued.
- none: immediately after receiving the enqueue.
- enqueued: after the crawl-url. is found to satisfy the crawl conditions and will attempt to be fetched.
- to-be-indexed: after the resource at the URL has been crawled, converted. This synchronization mode forces the indexer to do additional work to issue the synchronous reply in the most punctual manner.
- indexed: after the converted resource has been recorded by the indexer.
- indexed-no-sync: after the converted resource has been recorded by the indexer, but does not force the indexer to do additional work.
- force-indexed-sync - Flag used to force indexer to acknowledge document changes in the audit log only after indexing is complete and the changes will be reflected in search results. Usage: Internal
- enqueue-id (Text) - Unique string that identifies a particular enqueue.
- enqueue-id-for-audit-log (Text) - String that will be used to identify this enqueue in the audit-log instead of the value of the enqueue-id attribute. Usage: Internal
- originator (Text) - Unique string that identifies the update originator.
- arena (Text) - The name of the arena to include the data. If specified, the indexer-service for this collection must have its arena option enabled before any data is added to it. If the option is enabled, this attribute is a required attribute.
- parent-url (Text) - The parent URL that this URL should be associated with. This is used to keep the internal graph consistent when updates must be made outside of a normal crawl workflow.
- parent-url-normalized (May only be: parent-url-normalized) - A flag indicating that the parent-url attribute has already been normalized. If this is not present, the crawler will attempt to normalize that value.
- remote-time (xs:long) - Time at which the resource was fetched on the remote server as seconds since the epoch. Usage: Internal
- remote-dependent (Any of: delete, uncrawled) - Indicates that this update is dependent on a previous update: delete: this update deletes an existing crawl-url. uncrawled: this update updates the expiry time of a an existing crawl-url. Usage: Internal
- remote-previous-collection (Text) - The collection for the previous update to this crawl-url. Usage: Internal
- remote-previous-counter (Integer) - The counter value for the previous update to this crawl-url. Usage: Internal
- remote-depend-collection (Text) - The collection for the update that this crawl-url is predicated upon. Usage: Internal
- remote-depend-counter (Integer) - The counter value for the update that this crawl-url is predicated upon. Usage: Internal
- remote-collection-id (Integer) - The internal ID for the collection name that this crawl-url came from. Usage: Internal
- siphoned (Any of: duplicate, killed, filtered, terminated, unexpired, uncrawled, unchanged, error, unretrievable, rebasing, replaced, input-full, needed-gatekeeper, aborted, nonexistent, invalid, lc-too-long, remote-conflict, unknown) - Indicates that the crawler encountered an obstacle that prevented the crawl-url from meeting its requested synchronization: duplicate: The resource at this URL has already been crawled. killed: The URL was filtered by the crawl-conditions. filtered: The URL was filtered by the crawl-conditions and logged. terminated: The crawl-url could not be processed because the crawler was stopped after the enqueue entered the pipeline. rebasing: The crawl-url could not be processed because the crawler is attempting a rebase operation. unexpired: The previous crawl-url is not yet expired. unchanged: The resource at the URL is unchanged from its previously fetched copy. error: The resource at the URL could not be fetched, but the previously fetched copy is still valid. unretrievable: The resource at the URL could not be fetched. replaced: The enqueue was replaced with a newer one. input-full: The enqueue could not be processed because the input queue is full. needed-gatekeeper: The enqueue was the child of an index-atomic node but would have needed to be placed in the gatekeeper to proceed. aborted: The enqueue was aborted as part of a transaction. nonexistent: The crawl-url does not correspond to any crawl-urls in the crawler's database. lc-too-long: The size of the url attribute exceeds the 499 byte limit set in light crawler mode. remote-conflict: The crawl-url could not be processed because the collection has a more recent update for this URL, either from the collection itself or another distributed indexing node. unknown: The requested synchronization could not be met due to an unknown reason.
- enqueued-offline (May only be: enqueued-offline) - Flag that indicates that the crawl-url was enqueued offline.
- orphaned-atomic (May only be: orphaned-atomic) - Flag that indicates this crawl-url could not be indexed atomically due to a system error. As a result, this URL had no effect on the index. Usage: Internal
- enqueue-type (Any of: none, forced, reenqueued, export, status default:
none) - Indicates how an enqueued crawl-url should be processed by the crawler:
- none: The crawl-url is subject to all the standard checks: deduplication, URL limits and expiration.
- forced: Ignore the duplicates check and URL limits when procesing the crawl-url.
- reenqueued: Ignore the duplicates check, URL limits, and all expiration options when procesing the crawl-url.
- export: Fetch the resource located at the URL and return it to the caller. The resource will not be converted or indexed, and the crawler's persistent state will not be modified in any way as a result of this enqueue.
- status: Fetch the current status of a particular URL from the crawler's database.
- deleted - Temporary flag used by the crawler to track crawl-urls queued for deletion. Usage: Internal
- ignore-expires - Temporary flag used by the crawler to force directories to always be recrawled. Usage: Internal
- enqueued (Text) - A checksum representing the outgoing links from this crawl-url. This value is used internally to determine if the links have changed on refresh.
- referrer-vertex (Integer) - A temporary attribute used by the crawler to build the link-analysis table. Usage: Internal
- remote-collection (Text) - The name of the collection that this remote update originated from. Usage: Internal
- remote-counter (Integer) - Remote update's counter value. Used to ensure updates are applied sequentially. Usage: Internal
- remote-packet-id (Integer) - Temporary attribute used to keep track of an update that will eventually be added to the journal. Usage: Internal
- referree-url (Text) - Temporary attribute used to track exact duplicate information for remote updates. Usage: Internal
- request-queue-redir (Any of: output, indexer-output) - Temporary attribute to ensure that outgoing links are recorded as input before the enqueueing crawl-url is recorded as complete. Usage: Internal
- prodder (Any of: abort, index) - Attribute indicating that the crawl-url isn't a 'real' crawl-url: it's a prodder for an index-atomic that will be used to tell the indexer_output thread to abort an index-atomic or send it to the indexer. Usage: Internal
- gatekeeper-action (Any of: reject, replace, add-to-queue) - Indicates the
action that the gatekeeper will take if it encounters this crawl-url while another
crawl-url sharing the url attribute is in the crawler's pipeline.
- reject: the gatekeeper will reject this crawl-url and prevent it from entering the pipeline. This is the default behavior for crawl-urls enqueued as children of an index-atomic in the non-distributed case.
- replace: the gatekeeper will reject all crawl-urls currently in its queue that share the value of this crawl-url's url attribute, replacing them with this single crawl-url. This is the default behavior.
- add-to-queue: the gatekeeper will add this crawl-url to the tail of its queue. This is the default behavior for crawl-urls sent to a distributed indexing client as children of an index-atomic node.
- index-atomically - Attribute used to indicate the crawl-url is part of an atomic operation. Usage: Internal
- gatekeeper-list - Temporary attribute used to allow a URL to bypass the gatekeeper mechanism if it was released from the gatekeeper or reenqueued. Usage: Internal
- gatekeeper-id (xs:unsignedInt) - Temporary attribute used to associate nodes from the gatekeeper with their location in the persistent XML store. Usage: Internal
- offline-id (xs:unsignedInt) - Temporary attribute used to associate nodes from the offline queue with their location in the offline store. Usage: Internal
- offline-initialize - Temporary attribute used when initializing offline nodes. Usage: Internal
- input-on-resume (Boolean) - Temporary attribute used to inform the crawler's input thread that the crawl-url was input on resume, and thus requires special processing. Usage: Internal
- switched-status (Boolean) - Temporarily used by the apply changes operation to indicate that a crawl-url has switched its status during the operation. Usage: Internal
- from-input - Usage: Internal
- input-stub - Usage: Internal
- re-events (Integer) - Usage: Internal
- remembered (Boolean) - Usage: Internal
- notify-id (Integer) - Usage: Internal
- reply-id (Integer) - Usage: Internal
- obey-no-follow - Usage: Internal
- normalized - Temporary flag used to instruct the input processing thread to avoid normalizing the URL or applying the crawl conditions. Usage: Internal
- url-normalized - Temporary flag used to instruct the input processing thread to avoid normalizing the URL while still applying the crawl conditions. This is set on nodes that are reenqueued due to an indexer disconnection. Usage: Internal
- wait-on-enqueued - Usage: Internal
- graph-id-high-water (xs:unsignedInt) - Usage: Internal
- last-at (xs:long) - Usage: Internal
- indexed-n-docs (xs:unsignedInt) - Number of indexed documents corresponding to this URL.
- indexed-n-contents (xs:unsignedInt) - Number of indexed contents corresponding to this URL.
- indexed-n-bytes (xs:long) - Number of indexed bytes corresponding to this URL.
- light-crawler (May only be: light-crawler) - Usage: Internal
- remove-xml-data (Any of: always, on-success, input) - Usage: Internal
- disguised-delete (May only be: disguised-delete) - Temporary flag used to indicate that the crawl-url is really a light crawler crawl-delete for a URL that the crawler has no record of. Usage: Internal
- remote-counter-increased (May only be: remote-counter-increased) - Temporary flag used to indicate that the update caused the remote counter for its collection to increase. Usage: Internal
- delete-enqueue-id (Text) - Usage: Internal
- delete-originator (Text) - Usage: Internal
- delete-index-atomically (May only be: delete-index-atomically) - Usage: Internal
- purge-pending (May only be: purge-pending) - Temporary flag used to indicate that the crawl-url is deleted from the logs but not the index. Usage: Internal
- only-input - Temporary attribute used to indicate the crawl-url was never logged to the authority table. Usage: Internal
- Any user-defined attribute
Children
- Use these in the listed order. The sequence may not repeat.
- crawl-pipeline: (Exactly 1) - Container node for profiling data.
- curl-options: (Exactly 1) - Container for options used in fetching a particular URL.
- crawl-header: (Exactly 1) - Node containing HTTP header data for an associated URL.
- old-crawl: (Exactly 1) - A container for the previous copy of a crawl-url.
- crawl-links: (Exactly 1) - Used by distributed search.
- completed-crawl: (Exactly 1) - Used by distributed search.
- indexed-crawl: (Exactly 1) - Used by distributed search.
- log: (Exactly 1) - Tag in which the log nodes are collected
- crawl-data: (At least 1) - Node that encapsulates all crawler state corresponding to a particular document.