Handling Enqueueing Errors

Enqueueing a crawl-urls node to the crawler returns a crawler-service-enqueue-response element that provides summary information about the status of the enqueue request, and also includes information about the status of each crawl-url, crawl-delete, or index-atomic element that was associated with that enqueue request.

The crawler-service-enqueue-response element provides a number of attributes that provide specific information about different aspects of the enqueue operation:

error - an optional attribute that is only present if the enqueue request could not be processed. Possible values for this attribute are the following:
- database - the crawler's database is inaccessible
- invalid: The node that was enqueued was not a crawl-delete, crawl-url, crawl-urls, or index-atomic node
- read-only: The search collection to which the node was enqueued is in read-only mode. No additional data may be enqueued to that collection until the collection is no longer in read-only mode.
- terminating: The crawler is terminating, and no additional data may be enqueued until the crawler is restarted
Important: When a crawler-service-enqueue-response node with this attribute is returned, an exception is raised regardless of the value of the exception-on-failure attribute on the elements that were enqueued.
n-failed - (Optional) An integer value identifying the number of enqueued elements that failed to satisfy their synchronization requirements. When this attribute is present, the first n-failed element in the response will be the ones that are considered to have failed.
n-offline - (Optional) An integer value identifying the number of elements that were sent to the offline queue for that collection. Any value for the n-offline attribute is also part of the value for the n-success attribute.
n-success - An integer value identifying the number of URLs that reached their specified synchronization level. Any value for this attribute also includes any value for the n-offline attribute.
offline-queue-size - (Optional) An integer value identifying the size of the offline queue for this collection. This attribute is only present if an offline queue exists for this collection.

Note: See the documentation for the crawl-url or crawl-delete elements for detailed information about the synchronization properties for an enqueue operation. Complete information for all Watson Explorer Engine elements is available in the Watson Explorer Engine Schema Reference manual.

Enqueue requests that raise exceptions, such as when the error attribute is present on a crawler-service-enqueue-response node that was returned by an enqueue operation, are obvious indications that the enqueue operation did not succeed. The exception handler for the case should examine the value of the error attribute to determine how to proceed. Unless the value of the error attribute is invalid, you will usually want to re-send the data (after making sure that the crawler is able to receive enqueue requests). An error value of invalid means that there is a problem in the XML in the enqueue request, in which case resending it will not help.

Next, check the value of the n-failed attribute. If the value of this attribute is greater than 0, you will want to iterate over the nodes that you were attempting to enqueue.

To get accurate and detailed information about the source of an enqueue problem, you must check more than the error and n-failed attributes of a crawler-service-enqueue-response node in order to determine whether an enqueue request was successful. For example, the n-failed attribute will not be incremented for problems such as the following:

The actual URL enqueue operation resulted in an error
The URL enqueue resulted in an exact duplicate.
The URL enqueue was disallowed by a robots.txt file

Applications should therefore still iterate through all of the crawl-delete, crawl-url, and index-atomic nodes in a crawl-data-enqueue-response node and check the state attribute of each to identify any top-level operations that are not set to pending or success. The application should then examine the error attribute of that node to determine the actual result of each individual enqueue operation.

Tip: As mentioned previously, when enqueueing data with a synchronization mode of to-be-crawled or stronger, there is a guarantee that if the data can be indexed, it will be. This does not mean that errors cannot occur later in the process without being reported in the enqueue response. For example, errors could occur later in the process due to a conversion error when in the enqueued mode (the recommended mode, which guarantees a low latency response). However, in this type of error case, it would not be wise to re-enqueue the data because it could probably not be processed the second time around, either.

Errors for which you do not receive a synchronous response do not require any immediate action. You can subsequently examine and report them by querying the system logs.

The enqueueing functions provide an exception-on-failure argument that can be set to true to cause an exception to be thrown if any of the URLs that are enqueued cannot be processed. In general, it is preferable to leave this option set to false in order to receive a response that can be traversed and analyzed.

Each crawl-url or crawl-delete that is enqueued will be returned without the data but with some status information. See the Watson Explorer Engine Schema Reference for information about these and other elements in the Watson Explorer Engine schema.

XML message:

      <crawler-service-enqueue-response n-success="0" n-failed="2">
        <crawl-url url="http://vivisimo.com" siphoned="duplicate"
          hops="0"
          vertex="5"
          priority="0"
        />
      </crawler-service-enqueue-response>

In C#:

    try
    {
        SearchCollectionEnqueueResponse scer = port.SearchCollectionEnqueue(sce);
        if (scer != null && scer.crawlerserviceenqueueresponse.nfailedSpecified
              && scer.crawlerserviceenqueueresponse.nfailed > 0)
        {
            foreach (Object o in scer.crawlerserviceenqueueresponse.Items)
            {
                if (o is crawlurl)
                {
                    crawlurl cu = (crawlurl)o;
                    if (cu.siphonedSpecified)
                        System.Console.WriteLine(cu.url + " failed: " + cu.siphoned);
                }
                if (o is crawldelete)
                {
                    crawldelete cd = (crawldelete)o;
                    if (cd.siphonedSpecified)
                        System.Console.WriteLine(cd.url + " failed: " + cd.siphoned);
                }
            }
        }
    }
    catch (System.Exception ex)
    {
        handleException(ex);
    }

In Java:

    SearchCollectionEnqueueResponse scer = 
      port.searchCollectionEnqueue(sce);
    CrawlerServiceEnqueueSuccess cses = 
      scer.getCrawlerServiceEnqueueSuccess();
    if (cses.getNFailed() > 0) {
            for (Object o : cses.getCrawlUrlOrCrawlDelete()) {
                    if (o.getClass() == CrawlUrl.class) {
                            CrawlUrl cu = (CrawlUrl) o;
                            if (cu.getSiphoned() != null)
                                    System.out.println(cu.getUrl() + " failed: "
                                                    + cu.getSiphoned());
                    } else if (o.getClass() == CrawlDelete.class) {
                            CrawlDelete cd = (CrawlDelete) o;
                            if (cd.getSiphoned() != null)
                                    ;
                            System.out.println(cd.getUrl() + " failed: "
                                            + cd.getSiphoned());
                    }
            }
    }