On this page:
- Where to find the crawl status
- What the codes mean
- What to do if you see an error code
Where to find the crawl status
Each crawl's Seeds report includes a table in which each seed is listed with an accompanying status. This status indicates whether the seed was in fact successfully crawled, redirected (and then subsequently crawled), or not crawled, due to robots.txt exclusions or other crawling obstacles. When the crawler encounters a specific and known error, a corresponding code will accompany the crawl status "Not Crawled."
What the codes mean
The status codes that may appear next to the seeds in your Seeds report are listed and explained below. Some of these error codes are specific to our crawler while others are general HTTP response codes that are used universally on the Web:
Heritrix web crawler error codes
- -1 - DNS lookup failed
- -2 - HTTP connection to site failed
- -3 - HTTP connection to site broken
- -4 - HTTP timeout (before any meaningful response received)
- -5 - Unexpected runtime exception; ask web archivist to check runtime error log
- -6 - Prerequisite domain-lookup failed, so site could not be crawled
- -7 - URI recognized as unsupported or illegal
- -8 - Multiple retries all failed, retry limit reached
- -50 - Temporary status assigned URIs awaiting preconditions; contact web archivist for more information
- -60 - Failure status assigned URIs which could not be queued by the Frontier (and may in fact be unfetchable)
- -61 - Prerequisite robots.txt check failed, so site could not be crawled. This means that the site is likely blocking our crawler's IP range, so that we are unable to access their site to check the robots.txt file and begin capturing content.
- -404 - Empty http response interpreted for Heritrix's purposes as a 404 error
- -5000 / -5001 - Seed is out of scope. Check any host constraint rules you've added to your collection to see if any are blocking your seed.
General HTTP response codes
- 400 - Bad Request: The request had bad syntax or was inherently impossible to be satisfied.
- 401 - Unauthorized: Access to the requested URL requires log-in credentials.
- 403 - Forbidden: Access to the URL requested has been completely forbidden by the responding server.
- 404 - Not found: The server has not found anything matching the URL requested.
- 500 - Internal Error: The server encountered an unexpected problem which prevented it from serving the requested URL.
- 501 - Not implemented: The server does not support the facility required.
- 502 - Service temporarily overloaded: The server cannot process the request due to a high load (whether HTTP servicing or other requests). This could be a temporary issue.
- 503 - Gateway timeout: Similar to a 500 error, however in this case the server did not return within a time that the gateway was prepared to wait.
You can learn more about standard http response codes at w3.org.
What to do if you see an error code
If you seed an error code in your Seeds report and have any questions about it, please feel free to contact a Web Archivist.