On this page:
Where to find a seed's status
A crawl's seeds report lists each seed in the crawl, as well as a seed status. This status indicates whether the seed was crawled, redirected, or not crawled. When a seed is not crawled, there is a note why; if, for example, it was blocked by robots.txt the seed status will read “Blocked”. When the crawler encounters a specific and known error, a corresponding code will accompany the crawl status "Not Crawled." Some of these codes are HTTP statuses, while others are Heritrix error codes; these codes can reflect what is happening on the live web, something that occurred during the crawling process, or an issue with Wayback.
What the codes mean
The status codes that may appear next to the seeds in your Seeds report are listed and explained below. Some of these error codes are specific to our crawler while others are general HTTP response codes that are used universally on the Web:
Live web issue
Sometimes Heritrix has difficulty connecting with the site on the live web. In these cases, the first step is to confirm that the seed is still accessible on the live web. If the seed is still accessible, then in many cases, recrawling the seed will be successful.
- 404 - Not found: The server has not found anything matching the URL requested.
- -404 - Empty HTTP response interpreted by Heritrix as a 404 error
- 501 - Not implemented: The server does not support the facility required.
- 502 - Service temporarily overloaded: The server cannot process the request due to a high load (whether HTTP servicing or other requests). This could be a temporary issue.
- 503 - Gateway timeout: Similar to a 500 error, however in this case the server did not return within a time that the gateway was prepared to wait.
- -1 - DNS lookup failed
- -2 - HTTP connection to site failed
- -3 - HTTP connection to site broken
- -4 - HTTP timeout (before any meaningful response received)
- -5 - Unexpected runtime exception; ask web archivist to check runtime error log
- -6 - Prerequisite domain-lookup failed, so site could not be crawled
- -7 - URI recognized as unsupported or illegal
- -8 - Multiple retries all failed, retry limit reached
Crawler blocked
Site owners sometimes restrict automated users from visiting parts of their websites for security, search engine indexing, or other reasons. In these cases, changing the seed scope to ignore robots.txt, or adding log-in credentials for password-protected content may help. If these suggestions do not resolve the issue, it may be necessary to reach out to the site administrator and request that they add our crawlers or IP ranges to their "Allow list." The following codes may indicate that our crawler is being blocked:
- -61 - Prerequisite robots.txt check failed, so site could not be crawled. This means that the site is likely blocking our crawler's IP range, so that we are unable to access their site to check the robots.txt file and begin capturing content.
- 401 - Unauthorized: Access to the requested URL requires log-in credentials.
- 403 - Forbidden: Access to the URL requested has been completely forbidden by the responding server.
Rare statuses
Other less common errors are sometimes reported by the crawler; these require a support ticket to be submitted for follow-up:
- -50 - Temporary status assigned URIs awaiting preconditions
- -60 - Failure status assigned URIs which could not be queued by the crawler (and may be unfetchable)
Further resources
You can learn more about standard HTTP response codes at w3.org and Heritrix crawler response codes here.
Comments
0 comments
Please sign in to leave a comment.