HTTP status codes returned to the Web crawler
When you monitor a Web crawler, you can view information about the HTTP status codes that the crawler receives from the pages that it attempts to crawl.
Table summary
When you monitor the Web crawler history, or monitor the status of a specific URL, you can see information about the HTTP status codes that were returned to the crawler. You can use this information to manage the crawl space and optimize crawler performance. For example, if the crawler receives a large number of HTTP status codes for a URL, and the status codes indicate that pages at that location cannot be crawled, you can improve performance by removing that URL from the crawl space.
The following table lists the HTTP status codes and how the Web crawler interprets them. Values from 100 to 505 are standard HTTP status codes (see the Hypertext Transfer Protocol standard for more information). The remaining HTTP status codes are proprietary to Watson Explorer Content Analytics and the Web crawler.
Code | Description | Code | Description | Code | Description | Code | Description |
---|---|---|---|---|---|---|---|
NULL | Uncrawled | 400 | Bad Request | 500 | Internal server error | 693 | Select fail (URLFetcher) |
100 | Continue | 401 | Unauthorized | 501 | Not implemented | 694 | Write error (URLFetcher) |
101 | Switching protocols | 402 | Payment required | 502 | Bad gateway | 695 | Incomplete block header (URLFetcher) |
200 | Successful | 403 | Forbidden | 503 | Service unavailable | 699 | Unexpected error (URLFetcher) |
201 | Created | 404 | Not found | 504 | Gateway timeout | 700 | Parse error (no header end) |
202 | Accepted | 405 | Method not allowed | 505 | HTTP version not supported | 710 | Parse error (header) |
203 | Non-authoritative information | 406 | Not acceptable | 611 | Read error | 720 | Parse error (no HTTP code) |
204 | No content | 407 | Proxy authentication required | 612 | Connect error | 730 | Parse error (body) |
205 | Reset content | 408 | Request timeout | 613 | Read timeout | 740 or 4044 | Excluded by robots.txt file |
206 | Partial content | 409 | Conflict | 614 | SSL handshake failed | 741 | Robots temporarily unavailable |
300 | Multiple choices | 410 | Gone | 615 | Other read error | 760 | Excluded by crawl space definition |
301 | Moved permanently | 411 | Length required | 616 | FBA anomaly | 761 | Disallowed by local crawl space; allowed by global |
302 | Found | 412 | Precondition failed | 617 | Encoding error | 770 | Bad protocol or nonstandard system port |
303 | See other | 413 | Request entity too large | 618 | Redirect with no redirect URL | 780 | Excluded by file type exclusions |
304 | Not modified | 414 | Request URI is too long | 680 | DNS lookup failure | 786 | Invalid URL |
305 | Use proxy | 415 | Unsupported media type | 690 | Malformed URL | 2004 | No index META tag |
306 | (Unused) | 416 | Requested range not satisfiable | 691 | Lost connection (URLFetcher) | 3020 | Soft redirect |
307 | Temporary redirect | 417 | Expectation failed | 692 | Write timeout (URLFetcher) |
Table notes
- 4xx status codes
- You will rarely see a 400 (bad request) code. According the HTTP
status code standard, 4xx codes are supposed to be indicate that the
client (the crawler) failed. However, the problem is usually at the
server or in the URL that the crawler received as a link. For example,
some Web servers do not tolerate URLs that try to navigate up from
the site root (such as http://xyz.ibm.com/../../sales). Others Web
servers have no problem with this upward navigation and ignore the
parent directory operator (..) when the crawler is already at the
root.
Some servers treat a request for the site root as an error, and some obsolete links might request operations that are no longer recognized or implemented. When asked for a page that it no longer serves, the application server throws an exception, which causes the Web server to return the HTTP status code 400 because the request is no longer considered valid.
- 615
- Indicates that the crawler server that downloads data from Web sites encountered an unexpected exception. A large number of this type of status code might indicate that there is a problem with the crawler.
- 61x status codes
- Except for 615, the 61x status codes indicate problems that can
be expected in crawling, such as timing out. The following status
codes might require corrective action:
- 611, 612, and 613
- Slow sites or poor network performance might be the cause of these
problems.
- 611
- Indicates that an error occurred when the crawler retrieved a document.
- 612
- Indicates that an error occurred when the crawler attempted to connect to a Web server.
- 613
- Indicates that a timeout occurred while the crawler was retrieving a document.
- 614
- Indicates that the crawler is unable to crawl secure (HTTPS) sites.
If you believe that these sites should be accessible, verify that
the certificates are set up correctly on the crawler server and on
the target Web server. For example, if a site is certified by a recognized
certificate authorities (CAs), you can add new CAs to the trust store
that is used by the crawler.
Also look at how self-signed certificates are configured on the sites that you are trying to crawl. The crawler is configured to accept self-signed certificates. Some sites create a self-signed certificate for a root URL (such as http://sales.ibm.com/), and then try to use that certificate on subdomains (such as http://internal.sales.ibm.com/). The crawler cannot accept certificates that are used in this manner. It accepts self-signed certificates only if the domain name of the subject (sales.ibm.com) and the signer of the certificate match the domain name of the page that is being requested.
- 616
- Indicates that the login form for form-based authentication (FBA)
still appears in the download after reauthentication.
If the information provided in the FBA configuration file (login form, plus authentication data such as the user name, password, and so on) fails to authenticate the crawler, status code 616 is assigned to all pages dependent on the form-based authentication. The administrator should investigate to find out why the FBA configuration is not working.
- 617
- Indicates the inability to create a String from a document's byte content because the encoding string (charset) is invalid or the document contains invalid bytes.
- 618
- Indicates that the redirect URL is not valid when the crawler
receives the following HTTP status codes. It is possible that the
location of the HTTP response header is not valid.
301 Moved Permanently 302 Found
- 680
- Indicates that the crawler was not able to obtain IP addresses for hosts in the crawl space, perhaps because of network access problems. This type of error means that the crawler is not able to crawl entire sites, not just that it was unable to crawl some URLs. A large number of this type of status code greatly reduces throughput.
- 69x status codes
- Status codes 690 through 699 are never recorded in the crawler's persistent database. These codes represent outcomes that do not reflect the true outcome of a download from a remote host, but rather a temporary condition inside the crawler, such as one component that shuts down while another is waiting for a result or sending a result. These status codes appear in some logs, but not in the persistent record, and so should not be used as selection-set values.
- 7xx status codes
- The 7xx codes are mostly due to rules in the crawl space:
- 710 - 730
- Indicate that problems prevented the crawler from doing a complete download, or that the crawler encountered invalid HTML data at a site. If you see a large number of these types of status codes, contact your support representative for assistance.
- 740 or 4044
- Indicate that the content of a file cannot be indexed because
the document is excluded by restrictions in the site's robots.txt file.
- 740
- Indicates that anchor links that point to the excluded document can be included in the index.
- 4044
- Indicates that the anchor links in documents that point to the excluded document are also excluded from the index.
- 741
- Indicates that a site has a robots.txt file that allows the crawl, but the download failed. If it is repeatedly unable to crawl the URL, the URL is removed from the crawl space. If you seen a large number of this type of status code, check to see whether the target site is temporarily or permanently unavailable. If the target site is no longer available, remove it from the crawl space.
- 3020
- Indicates that a document with status code 200 contains a location header that refers the user agent to another URL.