HTTP status codes returned to the Web crawler

When you monitor a Web crawler, you can view information about the HTTP status codes that the crawler receives from the pages that it attempts to crawl.

Table summary

When you monitor the Web crawler history, or monitor the status of a specific URL, you can see information about the HTTP status codes that were returned to the crawler. You can use this information to manage the crawl space and optimize crawler performance. For example, if the crawler receives a large number of HTTP status codes for a URL, and the status codes indicate that pages at that location cannot be crawled, you can improve performance by removing that URL from the crawl space.

The following table lists the HTTP status codes and how the Web crawler interprets them. Values from 100 to 505 are standard HTTP status codes (see the Hypertext Transfer Protocol standard for more information). The remaining HTTP status codes are proprietary to Watson Explorer Content Analytics and the Web crawler.

Table 1. HTTP status codes from the Web crawler
Code	Description	Code	Description	Code	Description	Code	Description
NULL	Uncrawled	400	Bad Request	500	Internal server error	693	Select fail (URLFetcher)
100	Continue	401	Unauthorized	501	Not implemented	694	Write error (URLFetcher)
101	Switching protocols	402	Payment required	502	Bad gateway	695	Incomplete block header (URLFetcher)
200	Successful	403	Forbidden	503	Service unavailable	699	Unexpected error (URLFetcher)
201	Created	404	Not found	504	Gateway timeout	700	Parse error (no header end)
202	Accepted	405	Method not allowed	505	HTTP version not supported	710	Parse error (header)
203	Non-authoritative information	406	Not acceptable	611	Read error	720	Parse error (no HTTP code)
204	No content	407	Proxy authentication required	612	Connect error	730	Parse error (body)
205	Reset content	408	Request timeout	613	Read timeout	740 or 4044	Excluded by robots.txt file
206	Partial content	409	Conflict	614	SSL handshake failed	741	Robots temporarily unavailable
300	Multiple choices	410	Gone	615	Other read error	760	Excluded by crawl space definition
301	Moved permanently	411	Length required	616	FBA anomaly	761	Disallowed by local crawl space; allowed by global
302	Found	412	Precondition failed	617	Encoding error	770	Bad protocol or nonstandard system port
303	See other	413	Request entity too large	618	Redirect with no redirect URL	780	Excluded by file type exclusions
304	Not modified	414	Request URI is too long	680	DNS lookup failure	786	Invalid URL
305	Use proxy	415	Unsupported media type	690	Malformed URL	2004	No index META tag
306	(Unused)	416	Requested range not satisfiable	691	Lost connection (URLFetcher)	3020	Soft redirect
307	Temporary redirect	417	Expectation failed	692	Write timeout (URLFetcher)

Table notes

4xx status codes

You will rarely see a 400 (bad request) code. According the HTTP status code standard, 4xx codes are supposed to be indicate that the client (the crawler) failed. However, the problem is usually at the server or in the URL that the crawler received as a link. For example, some Web servers do not tolerate URLs that try to navigate up from the site root (such as http://xyz.ibm.com/../../sales). Others Web servers have no problem with this upward navigation and ignore the parent directory operator (..) when the crawler is already at the root.

Some servers treat a request for the site root as an error, and some obsolete links might request operations that are no longer recognized or implemented. When asked for a page that it no longer serves, the application server throws an exception, which causes the Web server to return the HTTP status code 400 because the request is no longer considered valid.

615

Indicates that the crawler server that downloads data from Web sites encountered an unexpected exception. A large number of this type of status code might indicate that there is a problem with the crawler.

61x status codes

Except for 615, the 61x status codes indicate problems that can be expected in crawling, such as timing out. The following status codes might require corrective action:

611, 612, and 613

Slow sites or poor network performance might be the cause of these problems.

611: Indicates that an error occurred when the crawler retrieved a document.
612: Indicates that an error occurred when the crawler attempted to connect to a Web server.
613: Indicates that a timeout occurred while the crawler was retrieving a document.

614

Indicates that the crawler is unable to crawl secure (HTTPS) sites. If you believe that these sites should be accessible, verify that the certificates are set up correctly on the crawler server and on the target Web server. For example, if a site is certified by a recognized certificate authorities (CAs), you can add new CAs to the trust store that is used by the crawler.

Also look at how self-signed certificates are configured on the sites that you are trying to crawl. The crawler is configured to accept self-signed certificates. Some sites create a self-signed certificate for a root URL (such as http://sales.ibm.com/), and then try to use that certificate on subdomains (such as http://internal.sales.ibm.com/). The crawler cannot accept certificates that are used in this manner. It accepts self-signed certificates only if the domain name of the subject (sales.ibm.com) and the signer of the certificate match the domain name of the page that is being requested.

616

Indicates that the login form for form-based authentication (FBA) still appears in the download after reauthentication.

If the information provided in the FBA configuration file (login form, plus authentication data such as the user name, password, and so on) fails to authenticate the crawler, status code 616 is assigned to all pages dependent on the form-based authentication. The administrator should investigate to find out why the FBA configuration is not working.

617

Indicates the inability to create a String from a document's byte content because the encoding string (charset) is invalid or the document contains invalid bytes.

618

Indicates that the redirect URL is not valid when the crawler receives the following HTTP status codes. It is possible that the location of the HTTP response header is not valid.

301 Moved Permanently
302 Found

680

Indicates that the crawler was not able to obtain IP addresses for hosts in the crawl space, perhaps because of network access problems. This type of error means that the crawler is not able to crawl entire sites, not just that it was unable to crawl some URLs. A large number of this type of status code greatly reduces throughput.

69x status codes

Status codes 690 through 699 are never recorded in the crawler's persistent database. These codes represent outcomes that do not reflect the true outcome of a download from a remote host, but rather a temporary condition inside the crawler, such as one component that shuts down while another is waiting for a result or sending a result. These status codes appear in some logs, but not in the persistent record, and so should not be used as selection-set values.

7xx status codes

The 7xx codes are mostly due to rules in the crawl space:

710 - 730

Indicate that problems prevented the crawler from doing a complete download, or that the crawler encountered invalid HTML data at a site. If you see a large number of these types of status codes, contact your support representative for assistance.

740 or 4044

Indicate that the content of a file cannot be indexed because the document is excluded by restrictions in the site's robots.txt file.

740: Indicates that anchor links that point to the excluded document can be included in the index.
4044: Indicates that the anchor links in documents that point to the excluded document are also excluded from the index.

741

Indicates that a site has a robots.txt file that allows the crawl, but the download failed. If it is repeatedly unable to crawl the URL, the URL is removed from the crawl space. If you seen a large number of this type of status code, check to see whether the target site is temporarily or permanently unavailable. If the target site is no longer available, remove it from the crawl space.

The remaining 7xx status codes mostly occur when you make changes to the crawl space after the crawler has been running for awhile. These status codes typically do not indicate problems that you need to address.

3020

Indicates that a document with status code 200 contains a location header that refers the user agent to another URL.