On this page:
- What is a robots.txt exclusion
- How to find and read a robots exclusion request
- How to determine if your crawl is blocked by a robots.txt file
- How to ignore robots.txt files
- Further information
What is a robots.txt exclusion?
The robots exclusion standard is a tool used by a webmaster to direct a web crawler not to crawl all or specified parts of their website. The webmaster places their request in the form of a robots.txt file that is easily found on their website (ex. example.com/robots.txt). Archive-It (like Google and most other search engines) uses a robot to crawl and archive web pages. By default, our crawler honors and respects all robots.txt exclusion requests. However on a case by case basis, you can set up rules to ignore robots.txt blocks for specific sites.
How to find and read a robots exclusion request
A robots.txt file is always located at the topmost level of a website and the file itself is always called robots.txt. To view any website's robots file, go to the site and simply add /robots.txt to the site's address. For example you can see Internet Archive's robots.txt file at: www.archive.org/robots.txt
If you see this text on a robots exclusion page, then all robots are excluded from crawling the site:
User-agent: *
Disallow: /
If you see this text on a robots exclusion page, then all robots are allowed to crawl the site:
User-agent: *
Disallow:
Webmasters can also choose to disallow select, rather than all, robots. In the example below, Archive-It's crawler is allowed into the site, but all other crawlers are not:
User-agent: archive.org_bot
Disallow:
User-agent: *
Disallow: /
Webmasters can also block certain directories on their site from a crawling robot. In the below example, you can see that all crawlers are blocked from crawling images on a site:
User-agent: *
Disallow: /images
Webmasters can also put a crawl delay on their site (in seconds). Below, you can see that all crawlers must wait 10 seconds between page requests on a site:
User-agent: *
Crawl-delay: 10
How to determine if your crawl is blocked by a robots.txt file
You can determine if your crawl will be blocked by a robots.txt file before launching it by accessing your target site's file in the manner described above. If Archive-It's crawling robot is specifically listed i this file as "disallowed," from all or designated sections, then you can expect the crawl to be blocked from those sections listed. You can recognize our crawler in these files by it's "user-agent" (name): archive.org_bot.
You can also determine whether or not your crawl was blocked by a robots.txt file after it runs by reviewing your Seeds report. The "Seed Status" column in this report will indicate whether an entire site was blocked by a robots.txt file. The "Blocked" column in your crawl's Hosts report will likewise show you any specific parts of a host domain that have been blocked by a robots.txt file.
How to remove a robots exclusion
If a webpage you want to crawl excludes our crawling robot (archive.org_bot), you should first try to contact the webmaster for the site, let them know why you want to archive their site, and request that they make an exception in their robots.txt file.
In these cases, it is always helpful to provide the webmaster with the following information:
- The name (user-agent) of our crawler: archive.org_bot
- Our crawlers' IP range is available upon request.
You can inform the webmaster that our crawler is very "polite," meaning that allowing it to crawl their site should not impact the site's performance or security in any way.
In the event that the webmaster does not respond to or rejects your request, you can use the directions provided below for ignoring robots exclusions.
How to ignore robots.txt files
Whether or not a webmaster will make an exception for our crawler in the manner described above, you can ignore robots exclusions and thereby crawl material otherwise blocked by a robots.txt file by requesting that we enable this special feature for your account. To get started, please contact our Web Archivists directly, identify any specific hosts or types of material blocked by robots exclusions that you wish to nonetheless crawl, and request that we enable this feature for your account.
Ignoring robots.txt by seed or by host, what's the difference?
You are able to choose whether you would like to ignore robots exclusions for all hosts within a specific seed (seed level rules) or all instances of a specific host within a collection (collection level rules). For more information on when to use seed level rules vs. collection level rules please visit our What's the Difference guidance.
Ignore robots.txt by seed
Once the "Ignore robots.txt" feature has been enabled for your account, you can override robots exclusions in your crawl on a seed-by-seed basis. To ignore all robots.txt blocks on hosts captured from a specific seed (including the seed host, and any host embedded content is coming from), click on the specific seed from your collection's seed list, followed by the "Seed Scope" tab, select "Ignore Robots.txt" from the drop-down menu, and click the "Add Rule" button to apply it to your seed's future crawls:
Ignore robots.txt by host
Once the "Ignore robots.txt" feature has been enabled for your account, you can also override robots exclusions in your collection on a host-by-host basis. To ignore all robots.txt blocks on hosts that appear anywhere during the course of your crawls, navigate to the "Collection Scope" tab of your collection's management area, select "Ignore Robots.txt" from the drop-down menu, add the hosts to which you would like to apply this new rule (exactly as they appear in your Hosts report), and click the "Add Rule" button to apply it to your seed's future crawls:
Note that you can also apply this host-specific rule directly from our actionable hosts reports.
Further information
For more general information about robots exclusions, see: http://www.robotstxt.org/
Comments
0 comments
Please sign in to leave a comment.