Robots.txt exclusions can prevent Archive-It crawlers from accessing part or all of a website. This article will help you understand how robots.txt files can impact your web archives, and when and how to avoid them.
On this page:
- What is a robots.txt exclusion
- How to find and read a robots exclusion request
- How do Archive-It crawlers handle robots.txt exclusions?
- How can I tell if my web archives are incomplete because of a robots.txt exclusion?
- What can I do if a site I want to crawl has a robots.txt exclusion?
- There are documents in the Blocked column of my crawl's Hosts report but my web archives look ok. Is this a problem?
- Further information
What is a robots.txt exclusion?
Site owners use a /robots.txt file to give instructions about their site to web crawlers (or robots), like the ones Archive-It uses; this is called The Robots Exclusion Protocol. The robots.txt file can tell crawlers not to access all or specific parts of their website.
How to find and read a robots exclusion request
A robots.txt file is always located at the topmost level of a website and is always called robots.txt. To view any website's robots file add /robots.txt to the site's address. For example, you can see Internet Archive's robots.txt file at: www.archive.org/robots.txt
The following are common examples of robots.txt directives you might see on a site's robots.txt file.
all crawlers are excluded from crawling the site:
all crawlers are allowed to crawl the site:
Archive-It's crawler is allowed into the site, but all other crawlers are not:
all crawlers are excluded from crawling images on a site:
Webmasters can put a crawl delay on their site (in seconds). Below, you can see that all crawlers must wait 10 seconds between page requests on a site:
How do Archive-It crawlers handle robots.txt exclusions?
Archive-It’s Standard crawling technology respects all robots.txt exclusions and crawl delays by default.
Brozzler respects most robots.txt exclusions, but will ignore exclusions on embedded documents (images, style sheets etc.) and crawl delays by default.
How can I tell if my web archives are incomplete because of a robots.txt exclusion?
Robots.txt exclusions can cause missing or incomplete captures (ex. sites missing images or styling elements) in Wayback. If you notice missing content in your web archives, your crawl reports should help determine whether or not robots.txt exclusions are to blame.
If the crawler was prevented from accessing a site entirely, the Seeds report will show a Blocked by Robots.txt status. The Blockedcolumn on a crawl’s Hosts report lists all documents that were not collected because of a robots.txt exclusion. Look out for .css, .js. or image files in the Blocked documents list.
What can I do if a site I want to crawl has a robots.txt exclusion?
If it’s an option, try contacting the site owner to let them know you’d like to archive their site. Archive-It’s crawlers are “polite”, meaning they should not impact the site's performance or security in any way. It may be helpful to provide them with the Archive-It crawler user agent (archive.org_bot) and IP range (submit a ticket for more information).
If contacting the site owner is not an option, you can add an Ignore Robots.txt scoping rule to the seed Seed or Host (collection level). This tells Archive-It crawlers to ignore that exclusion and collect in-scope documents from the site. If there are only a few documents missing and you don't need to crawl the site again, then you can identify and patch in missing documents using your crawl's Hosts report or Wayback QA.
There are documents in the Blocked column of my crawl's Hosts report but my web archives look ok. Is this a problem?
Not necessarily! Robots.txt files may exclude directories or sections of websites that are not required to replay the site, like login pages or administrator files. If you're happy with your archived site, you probably don't need to collect the documents listed under the Blocked column of your crawl's Hosts report.
For more general information about robots exclusions, see: http://www.robotstxt.org/