URLs from a host or an entire seed may be listed on the Seed or Hosts reports as "Blocked" because of Robots.txt exclusions.
A robots.txt file is a way for a webmaster to direct a web crawler (aka robot or spider) not to crawl all or specified parts of their website. By default, the Archive-It crawler honors and respects all robots.txt exclusion requests. On a case-by-case basis, institutions can set up rules to ignore robots.txt blocks for specific sites, though. To enable this capability for your account, please contact us directly.
If your entire seed is blocked by robots.txt, then you will see a "Blocked (robots.txt)" message in your Seeds report's "Status" column. If only certain URLs are blocked by robots.txt, those will appear in the "Blocked" column in the Hosts report.
Comments
0 comments
Please sign in to leave a comment.