Overview
While we continuously investigate and implement capture improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to capture or replay in their entirety. These difficulties affect all web crawlers, not just ours. When selecting seed URLs and reviewing your archived content, please keep these limitations in mind.
*For more information on what makes sites archive-friendly, there is an in-depth guide available from Stanford University Libraries.
On this page:
About
A webmaster can use a robot.txt exclusion to prevent certain content from being crawled. The Archive-It crawlers respect all robots.txt exclusions by default.
Troubleshooting
How to determine if your crawl is blocked by a robots.txt file:
- To see if an entire site you wish to crawl is being blocked, check your seed site for a robots.txt exclusion file before you crawl.
- To check if your crawl was blocked by a robots.txt file after it runs by, check your seed status report after your crawl is complete.
- To check if part of your website or embedded content is blocked, please check your Hosts report.
Outcome
If you wish to crawl a site blocked by robots, we encourage you to contact the webmaster of the blocked website to allow the Archive-it crawler in. The name (user-agent) of our crawler is archive.org_bot. There is also an Archive-It feature that allows users to override robots.txt blocks, which can be enabled upon request by submitting a support ticket.
Related content
Read your crawl's seeds report
Comments
0 comments
Please sign in to leave a comment.