HTTP status codes returned to the Web crawler

When you monitor a Web crawler, you can view information about the HTTP status codes that the crawler receives from the pages that it attempts to crawl.

Table summary

When you monitor the Web crawler history, or monitor the status of a specific URL, you can see information about the HTTP status codes that were returned to the crawler. You can use this information to manage the crawl space and optimize crawler performance. For example, if the crawler receives a large number of HTTP status codes for a URL, and the status codes indicate that pages at that location cannot be crawled, you can improve performance by removing that URL from the crawl space.

The following table lists the HTTP status codes and how the Web crawler interprets them. Values from 100 to 505 are standard HTTP status codes (see the Hypertext Transfer Protocol standard for more information). The remaining HTTP status codes are proprietary to Watson Explorer Content Analytics and the Web crawler.

Table 1. HTTP status codes from the Web crawler
Code Description Code Description Code Description Code Description
NULL Uncrawled 400 Bad Request 500 Internal server error 693 Select fail (URLFetcher)
100 Continue 401 Unauthorized 501 Not implemented 694 Write error (URLFetcher)
101 Switching protocols 402 Payment required 502 Bad gateway 695 Incomplete block header (URLFetcher)
200 Successful 403 Forbidden 503 Service unavailable 699 Unexpected error (URLFetcher)
201 Created 404 Not found 504 Gateway timeout 700 Parse error (no header end)
202 Accepted 405 Method not allowed 505 HTTP version not supported 710 Parse error (header)
203 Non-authoritative information 406 Not acceptable 611 Read error 720 Parse error (no HTTP code)
204 No content 407 Proxy authentication required 612 Connect error 730 Parse error (body)
205 Reset content 408 Request timeout 613 Read timeout 740 or 4044 Excluded by robots.txt file
206 Partial content 409 Conflict 614 SSL handshake failed 741 Robots temporarily unavailable
300 Multiple choices 410 Gone 615 Other read error 760 Excluded by crawl space definition
301 Moved permanently 411 Length required 616 FBA anomaly 761 Disallowed by local crawl space; allowed by global
302 Found 412 Precondition failed 617 Encoding error 770 Bad protocol or nonstandard system port
303 See other 413 Request entity too large 618 Redirect with no redirect URL 780 Excluded by file type exclusions
304 Not modified 414 Request URI is too long 680 DNS lookup failure 786 Invalid URL
305 Use proxy 415 Unsupported media type 690 Malformed URL 2004 No index META tag
306 (Unused) 416 Requested range not satisfiable 691 Lost connection (URLFetcher) 3020 Soft redirect
307 Temporary redirect 417 Expectation failed 692 Write timeout (URLFetcher)    

Table notes

4xx status codes
You will rarely see a 400 (bad request) code. According the HTTP status code standard, 4xx codes are supposed to be indicate that the client (the crawler) failed. However, the problem is usually at the server or in the URL that the crawler received as a link. For example, some Web servers do not tolerate URLs that try to navigate up from the site root (such as http://xyz.ibm.com/../../sales). Others Web servers have no problem with this upward navigation and ignore the parent directory operator (..) when the crawler is already at the root.

Some servers treat a request for the site root as an error, and some obsolete links might request operations that are no longer recognized or implemented. When asked for a page that it no longer serves, the application server throws an exception, which causes the Web server to return the HTTP status code 400 because the request is no longer considered valid.

615
Indicates that the crawler server that downloads data from Web sites encountered an unexpected exception. A large number of this type of status code might indicate that there is a problem with the crawler.
61x status codes
Except for 615, the 61x status codes indicate problems that can be expected in crawling, such as timing out. The following status codes might require corrective action:
611, 612, and 613
Slow sites or poor network performance might be the cause of these problems.
611
Indicates that an error occurred when the crawler retrieved a document.
612
Indicates that an error occurred when the crawler attempted to connect to a Web server.
613
Indicates that a timeout occurred while the crawler was retrieving a document.
614
Indicates that the crawler is unable to crawl secure (HTTPS) sites. If you believe that these sites should be accessible, verify that the certificates are set up correctly on the crawler server and on the target Web server. For example, if a site is certified by a recognized certificate authorities (CAs), you can add new CAs to the trust store that is used by the crawler.

Also look at how self-signed certificates are configured on the sites that you are trying to crawl. The crawler is configured to accept self-signed certificates. Some sites create a self-signed certificate for a root URL (such as http://sales.ibm.com/), and then try to use that certificate on subdomains (such as http://internal.sales.ibm.com/). The crawler cannot accept certificates that are used in this manner. It accepts self-signed certificates only if the domain name of the subject (sales.ibm.com) and the signer of the certificate match the domain name of the page that is being requested.

616
Indicates that the login form for form-based authentication (FBA) still appears in the download after reauthentication.

If the information provided in the FBA configuration file (login form, plus authentication data such as the user name, password, and so on) fails to authenticate the crawler, status code 616 is assigned to all pages dependent on the form-based authentication. The administrator should investigate to find out why the FBA configuration is not working.

617
Indicates the inability to create a String from a document's byte content because the encoding string (charset) is invalid or the document contains invalid bytes.
618
Indicates that the redirect URL is not valid when the crawler receives the following HTTP status codes. It is possible that the location of the HTTP response header is not valid.
301 Moved Permanently
302 Found
680
Indicates that the crawler was not able to obtain IP addresses for hosts in the crawl space, perhaps because of network access problems. This type of error means that the crawler is not able to crawl entire sites, not just that it was unable to crawl some URLs. A large number of this type of status code greatly reduces throughput.
69x status codes
Status codes 690 through 699 are never recorded in the crawler's persistent database. These codes represent outcomes that do not reflect the true outcome of a download from a remote host, but rather a temporary condition inside the crawler, such as one component that shuts down while another is waiting for a result or sending a result. These status codes appear in some logs, but not in the persistent record, and so should not be used as selection-set values.
7xx status codes
The 7xx codes are mostly due to rules in the crawl space:
710 - 730
Indicate that problems prevented the crawler from doing a complete download, or that the crawler encountered invalid HTML data at a site. If you see a large number of these types of status codes, contact your support representative for assistance.
740 or 4044
Indicate that the content of a file cannot be indexed because the document is excluded by restrictions in the site's robots.txt file.
740
Indicates that anchor links that point to the excluded document can be included in the index.
4044
Indicates that the anchor links in documents that point to the excluded document are also excluded from the index.
741
Indicates that a site has a robots.txt file that allows the crawl, but the download failed. If it is repeatedly unable to crawl the URL, the URL is removed from the crawl space. If you seen a large number of this type of status code, check to see whether the target site is temporarily or permanently unavailable. If the target site is no longer available, remove it from the crawl space.
The remaining 7xx status codes mostly occur when you make changes to the crawl space after the crawler has been running for awhile. These status codes typically do not indicate problems that you need to address.
3020
Indicates that a document with status code 200 contains a location header that refers the user agent to another URL.