Overview
While we continuously investigate and implement capture improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to capture or replay in their entirety. These difficulties affect all web crawlers, not just ours. When selecting seed URLs and reviewing your archived content, please keep these limitations in mind.
On this page:
About
A webmaster can use a robot.txt exclusion to prevent certain content from being crawled. The Archive-It crawlers respect all robots.txt exclusions by default.
Troubleshooting
How to determine if your crawl is blocked by a robots.txt file:
- To see if an entire site you wish to crawl is being blocked, check your seed site for a robots.txt exclusion file before you crawl.
- To check if your crawl was blocked by a robots.txt file after it runs by, check your seed status report after your crawl is complete.
- To check if part of your website or embedded content is blocked, please check your Hosts report.
Outcome
If you wish to crawl a site blocked by robots, try contacting the site owner to let them know you’d like to archive their site. Archive-It’s crawlers are “polite”, meaning they should not impact the site's performance or security in any way. The name (user-agent) of our crawler is archive.org_bot.
If contacting the site owner is not an option, you can add an Ignore Robots.txt scoping rule to the seed Seed or Host (collection level). If there are only a few documents missing and you don't need to crawl the site again, then you can identify and patch in missing documents using your crawl's Hosts report or Wayback QA.
Related content
Read your crawl's seeds report
Comments
0 comments
Please sign in to leave a comment.