Why does my crawl report tell me that URLs were blocked?

URLs from a host or an entire seed may be listed on the Seed or Hosts reports as "Blocked" because of Robots.txt exclusions.

A robots.txt file is a way for a webmaster to direct a web crawler (aka robot or spider) not to crawl all or specified parts of their website. By default, the Archive-It crawler honors and respects all robots.txt exclusion requests.

If your entire seed is blocked by robots.txt, then you will see a "Blocked (robots.txt)" message in your Seeds report's "Status" column. If only certain URLs are blocked by robots.txt, those will appear in the "Blocked" column in the Hosts report.

To set up rules to ignore robots.txt blocks for specific sites, see What can I do if a site I want to crawl has a robots.txt exclusion?

Articles in this section

Comments

Articles in this section

Related articles