Troubleshooting robots.txt exclusions

Updated November 05, 2024 02:12

Overview

While we continuously investigate and implement improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to collect or replay in their entirety. These difficulties affect all web crawlers, not just Archive-It's. When selecting seed URLs and reviewing your archived content, please keep these limitations in mind. For more information on what makes sites archive-friendly, see the Library of Congress's Creating Preservable Websites.

On this page:

About
Troubleshooting
Outcome
Related content

About

A webmaster can use a robot.txt exclusion to prevent certain content from being crawled. The Archive-It crawlers respect robots.txt exclusions by default.

Troubleshooting

How to determine if your crawl is blocked by a robots.txt file:

To see if an entire site you wish to crawl is being blocked, check your seed site for a robots.txt exclusion file before you crawl.
To check if your crawl was blocked by a robots.txt file after it runs, check your seed status report after your crawl is complete.
To check if part of your website or embedded content is blocked, check your Hosts report.

Outcome

If you want to crawl a site blocked by robots, try contacting the site owner to let them know you’d like to archive their site. Archive-It’s crawlers are “polite”, meaning they should not impact the site's performance or security in any way. The name (user-agent) of our crawler is archive.org_bot.

If contacting the site owner is not an option, you can add an Ignore Robots.txt scoping rule to the seed Seed or Host (collection level). If there are only a few documents missing and you don't need to crawl the site again, then you can identify and patch in missing documents using your crawl's Hosts report or Wayback QA.

Related content

Robots.txt exclusions and how they can impact your web archives

Reading your crawl's seeds report

Reading your crawl's hosts report

Understanding seed status

Comments

0 comments

Please sign in to leave a comment.