The Seed Status column in a Seeds Report gives you information about how Archive-It’s crawlers were able to interact with a seed during that crawl.
On this page
Seed Statuses
The following are codes you may see, what they mean, and what (if any) actions you may want to take.
Crawled
What does it mean?
The crawler was able to access and crawl the site.
What should you do?
Proceed with your regular QA steps
Crawled (Error ###)
What does it mean?
The crawler accessed a page that returned an error, but the error didn’t prevent the crawler from accessing more of the site.
What should you do?
Check the Wayback capture generated by this crawl. If it looks ok, you don’t need to take any additional action. Sometimes crawls can generate more than one capture of a given URL/document, so It’s possible that the document returned an error to the crawler for one capture, but not for another.
If any of your Wayback captures resolve to an error page, you can check a document’s CDX index to see if any captures were made with an error status code. (link to https://support.archive-it.org/hc/en-us/articles/115001790023#howitworks)
Redirected
What does it mean?
The crawler followed the same redirect that a browser might follow when loading this seed URL. It may, for example, have been redirected from http to https, or the URL with our without a / at the end. This may also indicate that the live site now resolves to a new URL. In general, the crawlers and Wayback should follow these redirects and collect content from the redirected URL.
What should you do?
Most redirects aren’t cause for concern. Check the Wayback capture that crawl generated. If it looks ok, you probably don’t need to change anything.
You may need to change your seed URL if:
- Wayback captures from crawls in which your seed returned a Redirected status don’t work or replay poorly.
- The seed URL redirects to a completely different URL that isn’t being crawled.
If you'd like help figuring out the best seed URL to use, please contact the Archive-It team.
Not Crawled (Blocked by robots.txt)
What does it mean?
The site you're trying to crawl has a robots.txt exclusion in place that prevents Archive-It crawlers from accessing it entirely.
What can you do?
You can add an Ignore robots.txt rule to that seed or to the seed URL’s host at the collection level.
If using Ignore Robots.txt rules isn’t an option, then you will need to contact the site owner to ask them to allow Archive-It crawlers to access the site. Contact the Archive-It team for more information.
Not Crawled (Queued)
What does it mean?
The crawl hit a limit before the crawler could begin crawling the seed.
What can you do?
Start a new crawl with a longer time limit, lower data/document limit, or fewer seeds to allow the crawler enough time.
Not Crawled (Error ###)
What does it mean?
The crawler encountered an error that prevented it from crawling the Seed URL.
See the list of error codes for additional context.
What can you do?
In general we recommend the following steps:
- Check to make sure the site is available on the live web.
- Assuming it is available, consider the following when starting a new crawl
- if it’s an option and you haven’t already, add an Ignore robots.txt rule to the seed
- Use a different crawling technology.
If, after going through these steps, your new crawl returns the same error, please contact the Archive-It team for more help.
Error codes
HTTP Status codes
- 401 - Unauthorized: Access to the requested URL requires log-in credentials.
- 403 - Forbidden: Access to the URL requested has been completely forbidden by the responding server.
- 404 - Not found: The server has not found anything matching the URL requested
- 426 - the server refuses to perform the request using the current protocol but might be willing to do so after the client upgrades to a different protocol
- 429 - he user has sent too many requests in a given amount of time ("rate limiting")
- 406 - Not Acceptable - the server cannot produce a response matching the list of acceptable values defined in the request's proactive content negotiation headers
- 501 - Not implemented: The server does not support the facility required.
- 502 - Service temporarily overloaded: The server cannot process the request due to a high load (whether HTTP servicing or other requests). This could be a temporary issue.
- 503 - Gateway timeout: Similar to a 500 error, however in this case the server did not return within a time that the gateway was prepared to wait.
- 999 - Request Denied (used by some social media platforms)
Heritrix response codes
- -1 - DNS lookup failed
- -2 - HTTP connection to site failed
- -3 - HTTP connection to site broken
- -4 - HTTP timeout (before any meaningful response received)
- -404 - Empty HTTP response interpreted by Heritrix as a 404 error
- -5 - Unexpected runtime exception
- -6 - Prerequisite domain-lookup failed, so site could not be crawled
- -61 - Prerequisite robots.txt check failed, so site could not be crawled.
- -50 - Temporary status assigned URIs awaiting preconditions
- -60 - Failure status assigned URIs which could not be queued by the crawler (and may be unfetchable)
- -7 - URI recognized as unsupported or illegal
- -8 - Multiple retries all failed, retry limit reached
You can learn more about standard HTTP response codes at w3.org and Heritrix crawler response codes.
Comments
0 comments
Please sign in to leave a comment.