crawler-status

A node used to communicate the crawler's current state and crawl progress.

Attributes

  • from-cache (May only be: from-cache) - Flag indicating that the status was served from an earlier cached copy. This occurs if the status is fetched while the crawler is saving its state to the crawler database.
  • version (Text) - Version of the crawler binary.
  • id (Text) - A unique identifier for this particular status node.
  • stopping-time (xs:long) - Time at which the stop command was received as seconds since the epoch.
  • expected-stop-time (xs:long) - Time at which the crawler's idle time is expected to end as seconds since the epoch.
  • time (xs:long required.) - Time at which this status node was produced by the crawler as seconds since the epoch.
  • host (Text) - Usage: This functionality is deprecated
  • n-input (xs:unsignedLong default: 0) - Total number of unique URLs input to the crawler for the purpose of indexing.
  • n-output (xs:unsignedLong default: 0) - Total number of unique URLs successfully crawled and indexed.
  • n-errors (xs:unsignedLong default: 0) - Total number of unique URLs that resulted in a fetch error or a conversion error.
  • n-error-rows (xs:unsignedLong default: 0) - TBD: number of rows in 'errors' table in log.sqlt.
  • n-http-errors (xs:unsignedLong default: 0) - Total number of unique URLs that resulted in an HTTP fetch error.
  • n-http-location (xs:unsignedLong default: 0) - Total number of unique URLs that resulted in an HTTP redirection.
  • n-filtered (xs:unsignedLong default: 0) - Total number of unique URLs that were filtered by the crawler conditional settings.
  • n-robots (xs:unsignedLong default: 0) - Total number of unique URLs that were not crawled due to the server's robots.txt file.
  • n-pending (Integer default: 0) - Total number of unique URLs input for indexing that are still being processed by the crawler.
  • n-pending-internal (Integer default: 0) - Total number of unique URLs, crawl-delete nodes, and native file export requests that are still being processed by the crawler.
  • n-awaiting-gate (Integer default: 0) - Total number of crawl-urls or crawl-deletes that are waiting to be processed because the crawler is currently processing another node with the same url or vertex attribute.
  • n-awaiting-input (Integer default: 0) - Total number of crawl-urls or crawl-deletes that are waiting to for initial processing by the crawler.
  • n-offline-queue (xs:unsignedLong default: 0) - Total number of crawl-urls or crawl-deletes that are waiting in the offline queue for processing.
  • n-awaiting-index-input (Integer default: 0) - Total number of crawl-urls or crawl-deletes that are waiting to be sent to the indexer.
  • n-awaiting-index-reply (Integer default: 0) - Total number of crawl-urls or crawl-deletes that have been sent to the indexer but have not yet been confirmed as indexed.
  • conversion-time (Integer default: 0) - Total time spent converting data as milliseconds.
  • n-sub (Integer default: 0) - Total number of crawl-datas processed by the crawler.
  • n-bytes (Decimal number default: 0) - Total size of all resources crawled as bytes. Value includes URLs that were crawled from cache.
  • n-dl-bytes (Decimal number default: 0) - Total size of all resources crawled as bytes. Value excludes URLs that were crawled from cache.
  • n-redirect (Integer default: 0) - Total number of redirected URLs processed by the crawler.
  • n-duplicates (Integer default: 0) - Total number of exact duplicate URLs processed by the crawler.
  • n-deleted (Integer default: 0) - Total number of URLs deleted by the crawler.
  • n-cache-complete (Integer default: 0) - Total number of URLs crawled from the cache.
  • converted-size (Decimal number default: 0) - Total size of all converted data in bytes. Value excludes URLs that were crawled from cache.
  • elapsed (Integer default: 0) - Total elapsed time for this crawl in seconds. On resume, all previous crawl times are included.
  • this-elapsed (Integer default: 0) - Total elapsed time for this crawl in seconds. On resume, all previous crawl times are excluded.
  • upgrade-schema (May only be: upgrade-schema) - When set, this flag indicates that the crawler is in the process of updating its log schema as part of a backward compatibility procedure.
  • sanitize-rebase (May only be: sanitze-rebase) - When set, this flag indicates that the crawler is in the process of sanitizing records obtained from another crawler as a result of a successful rebase request.
  • request-rebase (May only be: request-rebase) - When set, this flag indicates that the crawler is in the process of requesting a rebase from a remote collection.
  • copy-rebase (May only be: copy-rebase) - When set, this flag indicates that the crawler is in the process of copying files in order to service a rebase request from a remote collection.
  • receive-rebase (May only be: receive-rebase) - When set, this flag indicates that the crawler is in the process of receiving files from a remote collection as part of the rebase operation.
  • resume (May only be: resume) - When set, this flag indicates that the crawler is in the process of a resume operation.
  • complete (Any of: complete, aborted, unexpected, docs-limit, urls-limit, input-urls-limit, time-limit) - When set, this flag indicates that the crawler has finished crawling the seed URLs. The reason is stored as the value for the attribute: complete: the seeds have been completely crawled. This is independent of the work done after crawling the seed URLs, such as processing externally enqueued URLs. aborted: the crawl was aborted. unexpected: the seed was completely crawled, but the crawler received additional work in the form of an enqueue. docs-limit: the crawler exceeded the maximum number of documents to be crawled. Deprecated. urls-limit: the crawler exceeded the maximum number of completed URLs. input-urls-limit: the crawler exceeded the maximum number of input URLs. time-limit: the crawler exceeeded the maximum crawling time.
  • idle (May only be: idle) - When set, this flag indicates that the crawler is waiting for additional work in an idle state.
  • final (May only be: final) - When set, this flag indicates that this is the last value of the crawler-status node that will be recorded on this run. Additional status requests will result in this node being returned until the crawler is restarted.
  • performing-vacuum (May only be: performing-vacuum) - When set, this flag indicates that the crawler is performing a requested vacuum operation to compact its database. The vacuum operation may take a long time to perform, so enqueue operations should be suspended until it is finished.
  • error - Usage: Internal
  • config-md5
  • service-status (Any of: stopped, running) - Provides a simple way to determine if the service is running or stopped.

Children

  • Use these in the listed order. The sequence may not repeat.
    • converter-timings: (At most 1) - Container for all the timing status.
    • crawl-thread: (Zero or more) - A node that indicates the state of a crawler thread.
    • crawl-remote-status: (Zero or more) - A node indicating the status of a distributed seach collection that this collection is requesting or serving.
    • crawl-client-status: (Zero or more) - A node indicating the status of the distributed search clients.
    • crawler-status: (Zero or more) - A node used to communicate the crawler's current state and crawl progress.
    • crawl-hops-output: (At most 1) - Contains hop statistics for all URLs completely processed by the crawler.
    • crawl-hops-input: (At most 1) - Contains hop statistics for all URLs currently being processed by the crawler.
    • queues: (At most 1) - Container for detailed request queue status information.
    • crawl-remote-all-status: (At most 1) - A container for nodes describing the state of a distributed search collection's clients and servers.