Overview
While we continuously investigate and implement improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to collect or replay in their entirety. These difficulties affect all web crawlers, not just Archive-It's. When selecting seed URLs and reviewing your archived content, please keep these limitations in mind. For more information on what makes sites archive-friendly, see the Library of Congress's Creating Preservable Websites.
On this page:
About
A webmaster can use a robot.txt exclusion to prevent certain content from being crawled. The Archive-It crawlers respect robots.txt exclusions by default.
Troubleshooting
How to determine if your crawl is blocked by a robots.txt file:
- To see if an entire site you wish to crawl is being blocked, check your seed site for a robots.txt exclusion file before you crawl.
- To check if your crawl was blocked by a robots.txt file after it runs, check your seed status report after your crawl is complete.
- To check if part of your website or embedded content is blocked, check your Hosts report.
Outcome
If you want to crawl a site blocked by robots, try contacting the site owner to let them know you’d like to archive their site. Archive-It’s crawlers are “polite”, meaning they should not impact the site's performance or security in any way. The name (user-agent) of our crawler is archive.org_bot.
If contacting the site owner is not an option, you can add an Ignore Robots.txt scoping rule to the seed Seed or Host (collection level). If there are only a few documents missing and you don't need to crawl the site again, then you can identify and patch in missing documents using your crawl's Hosts report or Wayback QA.
Related content
Robots.txt exclusions and how they can impact your web archives
Reading your crawl's seeds report
Comments
0 comments
Please sign in to leave a comment.