URLs from a host or an entire seed may be listed on the Seed or Hosts reports as "Blocked" because of Robots.txt exclusions.
A robots.txt file is a way for a webmaster to direct a web crawler (aka robot or spider) not to crawl all or specified parts of their website. By default, the Archive-It crawler honors and respects all robots.txt exclusion requests.
If your entire seed is blocked by robots.txt, then you will see a "Blocked (robots.txt)" message in your Seeds report's "Status" column. If only certain URLs are blocked by robots.txt, those will appear in the "Blocked" column in the Hosts report.
To set up rules to ignore robots.txt blocks for specific sites, see What can I do if a site I want to crawl has a robots.txt exclusion?
Comments
0 comments
Please sign in to leave a comment.