A node that encapsulates all crawler state for a particular URL.
Attributes
- __internal__ (May only be: dns_robots) - Flag indicating that the
crawl-url is used for internal processing only and should not contribute to the logs or
the index. A typical use for these crawl-urls is fetching the robots.txt file. Usage:
Internal
- n-redirs (Integer default: 0) - Number of times the crawler was redirected
to arrive at a particular URL.
- orig-url (Text) - The URL that the crawler orignally tried to crawl before
being redirected to the one stored in the url attribute.
- url (Text) - The URL of the resource represented by this crawl-url.
- crawl-url (Text) - Usage: Internal
- redir-from (Text) - The URL that this URL was redirected from.
- redir-to (Text) - The URL that this URL was redirected to.
- state (Any of: pending, success, warning, error) - The state of this
crawl-url. This attribute is set by the status query and can have the following possible
values: pending: the resource is currently in the crawler pipeline. success: the resource
exited the crawler pipeline and processing did not result in an error or warning. warning:
the resource exited the crawler pipeline but some data was not successfully indexed.
error: the resource exited the crawler pipeline but no data was successfully indexed. If
the enqueue did not reach the indexer, the siphoned attribute will indicate why. Otherwise
the log child will indicate the errors.
- status (Any of: starting, applying changes, stopping, refreshing, resuming,
input, complete, redir, disallowed by robots.txt, filtered, error, duplicate, killed,
none default: none) - The status of this crawl-url: input: the resource is in the
process of being fetched by the crawler, complete: the resource was successfully fetched
and converted, redir: attemtping to fetch the resource resuled in a redirection,
disallowed by robots.txt: the robots.txt file does not allow the resource to be crawled,
filtered: the crawl conditions do not allow the resource to be crawled, but the crawl-url
will be recorded in the logs. killed: the crawl conditions do not allow the resource to be
crawled, and the crawl-url will not be recorded in the logs. error: there was an error
attempting to fetch or convert the URL. duplicate: the resource is an exact duplicate of a
resource that was previously crawled. Other possible status values, such as applying
changes, refreshing, resuming, starting, and stopping are for internal use only.
- output-destination (Any of: cache, indexer) - String used to route a
crawl-url to a particular destination in the crawler. Usage: Internal
- http-status (xs:long) - The HTTP status code returned when the crawler
attempted to fetch the resource at this URL.
- input-at (xs:long) - Seconds since the epoch at which this crawl-url was
written to the logs in its input state.
- recorded-at (xs:long) - Seconds since the epoch at which this crawl-url
was written to the logs in its input state.
- at-datetime (Date/Time) - Time at which this crawl-url was fetched or
failed to be fetched.
- at (xs:long) - Seconds since the epoch at which this crawl-url was fetched
or failed to be fetched.
- filetime (xs:long) - Seconds since the epoch at which the resource at this
URL was last updated.
- batch-id (Text) - Value used to determine if multiple URLs with a shared
vse-key were enqueued during a single crawler instance.
- change-id (Text) - A protocol-dependent checksum of various metadata that
is used to determine whether the resource at this URL has changed since the last crawl.
Usage: Internal
- input-purged - A flag indicating that the input crawl-url for this crawl-url is
not available because it has been previously purged from the database. Usage:
Internal
- content-type (Text) - The content type of the resource at this URL.
- size (xs:long) - Total size in bytes of the fetched resource at this
URL.
- n-sub (Integer) - Total number of schema.x.element.crawl-data children
under this crawl-url.
- conversion-time (Integer) - Total time in seconds spent converting the
fetched resource at this URL.
- converted-size (Decimal number) - Total size in bytes after performing all
conversion steps on the fetched resource at this URL.
- speed (Decimal number) - Total size in bytes of the resource at this URL
divided by the total fetch time in seconds.
- error (Text) - A string indicating the reason why the crawler was unable
to fetch or convert the resource at this URL.
- warning (Text) - A string indicating any problems the crawler encountered
while attempting to fetch or convert the resource at this URL.
- hops (Integer default: 0) - The number of hops between this URL and the
seed URL that caused the crawler to first encounter it.
- vertex (xs:unsignedInt) - A unique identifier that the crawler assigns to
every URL that it encounters.
- exact (Text) - A string representing a checksum on the contents of all
schema.x.element.crawl-data children and their ACL attributes. Used for determining
whether a resource at a URL is an exact duplicate of a resource at a previously crawled
URL.
- error-msg (Text) - Used to temporarily pass around error messages in the
crawler. Usage: Internal
- exact-duplicate (May only be: exact-duplicate) - A flag indicating that
the URL's content is an exact duplicate of a previously crawled URL.
- verbose (May only be: verbose) - The presence of this flag indicates that
the progress of this crawl-url should be tracked in the crawler's debugging log.
- uncrawled (Any of: unexpired, unchanged, error, unknown) - A general
reason why this resource was not recrawled: unexpired: a previous copy of this crawl-url
was still valid at the time of the crawl, unchanged: metadata indicated that the resource
at this URL was unchanged since it was last fetched, error: attempting to fetch the
resource at this URL yielded an error and the previously fetched copy was still valid at
the time of the crawl, unknown: the resource was not recrawled for an unknown reason. Used
in conjunction with the uncrawled-why attribute.
- uncrawled-why (Text) - A specific reason why this resource was not
recrawled. Used in conjunction with the uncrawled attribute.
- crawled-locally (May only be: crawled-locally) - Flag used to indicate
that no contact with a remote server was required and that this URL shouldn't be involved
in the delay computation.
- priority (Integer default: 0) - An integer indicating the priority of this
crawl-url relative to other crawl-urls and crawl-deletes in the crawler's queues. A larger
value indicates a higher priority.
- input-priority (Integer) - Stores the actual priority on URLs enqueued by
the resume operation. Set internally by the crawler for temporary use. Usage:
Internal
- default-acl (Text) - The ACL that will be applied to the resource when no
other ACL is available.
- ip (Text) - The IP address used to fetch the resource at the URL.
- i-ip (Integer) - Integer identifier used to associate a particular IP
address with a DNS entry that corresponds to multiple IP addresses.
- forced-vse-key (Text) - Force the crawler to assign this vse-key to this
crawl-url rather than allow the crawler to assign one automatically.
- forced-vse-key-normalized (May only be: forced-vse-key-normalized) - Flag
that indicates the forced-vse-key should not be normalized by automatically including the
base-url in its value.
- synchronization (Any of: none, enqueued, to-be-indexed, indexed,
indexed-no-sync Deprecated values: to-be-crawled default: enqueued) - Indicates at
which point the crawler should return success for an enqueued crawl-url. All
synchronizations other than none will cause the enqueue to be committed to secondary
storage before a synchronous reply is issued.
- none: immediately after receiving the enqueue.
- enqueued: after the crawl-url. is found to satisfy the crawl conditions and will
attempt to be fetched.
- to-be-indexed: after the resource at the URL has been crawled, converted. This
synchronization mode forces the indexer to do additional work to issue the synchronous
reply in the most punctual manner.
- indexed: after the converted resource has been recorded by the indexer.
- indexed-no-sync: after the converted resource has been recorded by the indexer, but
does not force the indexer to do additional work.
- force-indexed-sync - Flag used to force indexer to acknowledge document changes
in the audit log only after indexing is complete and the changes will be reflected in
search results. Usage: Internal
- enqueue-id (Text) - Unique string that identifies a particular
enqueue.
- enqueue-id-for-audit-log (Text) - String that will be used to identify
this enqueue in the audit-log instead of the value of the enqueue-id attribute. Usage:
Internal
- originator (Text) - Unique string that identifies the update
originator.
- arena (Text) - The name of the arena to include the data. If specified,
the indexer-service for this collection must have its arena option enabled before any data
is added to it. If the option is enabled, this attribute is a required attribute.
- parent-url (Text) - The parent URL that this URL should be associated
with. This is used to keep the internal graph consistent when updates must be made outside
of a normal crawl workflow.
- parent-url-normalized (May only be: parent-url-normalized) - A flag
indicating that the parent-url attribute has already been normalized. If this is not
present, the crawler will attempt to normalize that value.
- remote-time (xs:long) - Time at which the resource was fetched on the
remote server as seconds since the epoch. Usage: Internal
- remote-dependent (Any of: delete, uncrawled) - Indicates that this update
is dependent on a previous update: delete: this update deletes an existing crawl-url.
uncrawled: this update updates the expiry time of a an existing crawl-url. Usage:
Internal
- remote-previous-collection (Text) - The collection for the previous update
to this crawl-url. Usage: Internal
- remote-previous-counter (Integer) - The counter value for the previous
update to this crawl-url. Usage: Internal
- remote-depend-collection (Text) - The collection for the update that this
crawl-url is predicated upon. Usage: Internal
- remote-depend-counter (Integer) - The counter value for the update that
this crawl-url is predicated upon. Usage: Internal
- remote-collection-id (Integer) - The internal ID for the collection name
that this crawl-url came from. Usage: Internal
- siphoned (Any of: duplicate, killed, filtered, terminated, unexpired,
uncrawled, unchanged, error, unretrievable, rebasing, replaced, input-full,
needed-gatekeeper, aborted, nonexistent, invalid, lc-too-long, remote-conflict,
unknown) - Indicates that the crawler encountered an obstacle that prevented the
crawl-url from meeting its requested synchronization: duplicate: The resource at this URL
has already been crawled. killed: The URL was filtered by the crawl-conditions. filtered:
The URL was filtered by the crawl-conditions and logged. terminated: The crawl-url could
not be processed because the crawler was stopped after the enqueue entered the pipeline.
rebasing: The crawl-url could not be processed because the crawler is attempting a rebase
operation. unexpired: The previous crawl-url is not yet expired. unchanged: The resource
at the URL is unchanged from its previously fetched copy. error: The resource at the URL
could not be fetched, but the previously fetched copy is still valid. unretrievable: The
resource at the URL could not be fetched. replaced: The enqueue was replaced with a newer
one. input-full: The enqueue could not be processed because the input queue is full.
needed-gatekeeper: The enqueue was the child of an index-atomic node but would have needed
to be placed in the gatekeeper to proceed. aborted: The enqueue was aborted as part of a
transaction. nonexistent: The crawl-url does not correspond to any crawl-urls in the
crawler's database. lc-too-long: The size of the url attribute exceeds the 499 byte limit
set in light crawler mode. remote-conflict: The crawl-url could not be processed because
the collection has a more recent update for this URL, either from the collection itself or
another distributed indexing node. unknown: The requested synchronization could not be met
due to an unknown reason.
- enqueued-offline (May only be: enqueued-offline) - Flag that indicates
that the crawl-url was enqueued offline.
- orphaned-atomic (May only be: orphaned-atomic) - Flag that indicates this
crawl-url could not be indexed atomically due to a system error. As a result, this URL had
no effect on the index. Usage: Internal
- enqueue-type (Any of: none, forced, reenqueued, export, status default:
none) - Indicates how an enqueued crawl-url should be processed by the crawler:
- none: The crawl-url is subject to all the standard checks: deduplication, URL limits
and expiration.
- forced: Ignore the duplicates check and URL limits when procesing the
crawl-url.
- reenqueued: Ignore the duplicates check, URL limits, and all expiration options when
procesing the crawl-url.
- export: Fetch the resource located at the URL and return it to the caller. The
resource will not be converted or indexed, and the crawler's persistent state will not
be modified in any way as a result of this enqueue.
- status: Fetch the current status of a particular URL from the crawler's
database.
- deleted - Temporary flag used by the crawler to track crawl-urls queued for
deletion. Usage: Internal
- ignore-expires - Temporary flag used by the crawler to force directories to
always be recrawled. Usage: Internal
- enqueued (Text) - A checksum representing the outgoing links from this
crawl-url. This value is used internally to determine if the links have changed on
refresh.
- referrer-vertex (Integer) - A temporary attribute used by the crawler to
build the link-analysis table. Usage: Internal
- remote-collection (Text) - The name of the collection that this remote
update originated from. Usage: Internal
- remote-counter (Integer) - Remote update's counter value. Used to ensure
updates are applied sequentially. Usage: Internal
- remote-packet-id (Integer) - Temporary attribute used to keep track of an
update that will eventually be added to the journal. Usage: Internal
- referree-url (Text) - Temporary attribute used to track exact duplicate
information for remote updates. Usage: Internal
- request-queue-redir (Any of: output, indexer-output) - Temporary attribute
to ensure that outgoing links are recorded as input before the enqueueing crawl-url is
recorded as complete. Usage: Internal
- prodder (Any of: abort, index) - Attribute indicating that the crawl-url
isn't a 'real' crawl-url: it's a prodder for an index-atomic that will be used to tell the
indexer_output thread to abort an index-atomic or send it to the indexer. Usage:
Internal
- gatekeeper-action (Any of: reject, replace, add-to-queue) - Indicates the
action that the gatekeeper will take if it encounters this crawl-url while another
crawl-url sharing the url attribute is in the crawler's pipeline.
- reject: the gatekeeper will reject this crawl-url and prevent it from entering the
pipeline. This is the default behavior for crawl-urls enqueued as children of an
index-atomic in the non-distributed case.
- replace: the gatekeeper will reject all crawl-urls currently in its queue that share
the value of this crawl-url's url attribute, replacing them with this single
crawl-url. This is the default behavior.
- add-to-queue: the gatekeeper will add this crawl-url to the tail of its queue. This
is the default behavior for crawl-urls sent to a distributed indexing client as
children of an index-atomic node.
Usage: Internal
- index-atomically - Attribute used to indicate the crawl-url is part of an atomic
operation. Usage: Internal
- gatekeeper-list - Temporary attribute used to allow a URL to bypass the
gatekeeper mechanism if it was released from the gatekeeper or reenqueued. Usage:
Internal
- gatekeeper-id (xs:unsignedInt) - Temporary attribute used to associate
nodes from the gatekeeper with their location in the persistent XML store. Usage:
Internal
- offline-id (xs:unsignedInt) - Temporary attribute used to associate nodes
from the offline queue with their location in the offline store. Usage: Internal
- offline-initialize - Temporary attribute used when initializing offline nodes.
Usage: Internal
- input-on-resume (Boolean) - Temporary attribute used to inform the
crawler's input thread that the crawl-url was input on resume, and thus requires special
processing. Usage: Internal
- switched-status (Boolean) - Temporarily used by the apply changes
operation to indicate that a crawl-url has switched its status during the operation.
Usage: Internal
- from-input - Usage: Internal
- input-stub - Usage: Internal
- re-events (Integer) - Usage: Internal
- remembered (Boolean) - Usage: Internal
- notify-id (Integer) - Usage: Internal
- reply-id (Integer) - Usage: Internal
- obey-no-follow - Usage: Internal
- normalized - Temporary flag used to instruct the input processing thread to avoid
normalizing the URL or applying the crawl conditions. Usage: Internal
- url-normalized - Temporary flag used to instruct the input processing thread to
avoid normalizing the URL while still applying the crawl conditions. This is set on nodes
that are reenqueued due to an indexer disconnection. Usage: Internal
- wait-on-enqueued - Usage: Internal
- graph-id-high-water (xs:unsignedInt) - Usage: Internal
- last-at (xs:long) - Usage: Internal
- indexed-n-docs (xs:unsignedInt) - Number of indexed documents
corresponding to this URL.
- indexed-n-contents (xs:unsignedInt) - Number of indexed contents
corresponding to this URL.
- indexed-n-bytes (xs:long) - Number of indexed bytes corresponding to this
URL.
- light-crawler (May only be: light-crawler) - Usage: Internal
- remove-xml-data (Any of: always, on-success, input) - Usage: Internal
- disguised-delete (May only be: disguised-delete) - Temporary flag used to
indicate that the crawl-url is really a light crawler crawl-delete for a URL that the
crawler has no record of. Usage: Internal
- remote-counter-increased (May only be: remote-counter-increased) -
Temporary flag used to indicate that the update caused the remote counter for its
collection to increase. Usage: Internal
- delete-enqueue-id (Text) - Usage: Internal
- delete-originator (Text) - Usage: Internal
- delete-index-atomically (May only be: delete-index-atomically) - Usage:
Internal
- purge-pending (May only be: purge-pending) - Temporary flag used to
indicate that the crawl-url is deleted from the logs but not the index. Usage:
Internal
- only-input - Temporary attribute used to indicate the crawl-url was never logged
to the authority table. Usage: Internal
- Any user-defined attribute
Children
- Use these in the listed order. The sequence may not repeat.
- crawl-pipeline: (Exactly 1) - Container node for profiling data.
- curl-options: (Exactly 1) - Container for options used in fetching a particular
URL.
- crawl-header: (Exactly 1) - Node containing HTTP header data for an associated
URL.
- old-crawl: (Exactly 1) - A container for the previous copy of a crawl-url.
- crawl-links: (Exactly 1) - Used by distributed search.
- completed-crawl: (Exactly 1) - Used by distributed search.
- indexed-crawl: (Exactly 1) - Used by distributed search.
- log: (Exactly 1) - Tag in which the log nodes are collected
- crawl-data: (At least 1) - Node that encapsulates all crawler state corresponding
to a particular document.