Why didn't some pages get archived?

There are a few reasons why specific pages within a seed site would not get archived:

Robots.txt: Parts of the site could be blocked from our crawler by a robots exclusion. Crawling for Archive-It is done with the user-agent archive.org_bot; you can check to see if your seed has blocked this web crawler by going to www.yourseed.com/robots.txt. To learn more about robots.txt, see Robots.txt exclusions and how they can impact your web archives.
Not linked: Our crawler can only find web pages to which your seed site links. If there are parts of the site that are not linked to from a page that is in scope, the will not be archived. If you know that you want to archive a specific page that is not linked to from anywhere else, please list it as a seed.
Connection Error: Sometimes, our crawler will not be able to connect to the site that you want to archive either because of an error on the host side or because its owner has forbidden access to it. When this occurs, an error notice will be logged on your Seeds report. To learn more about these error codes and what they mean, see Understanding seed status.
Out of Scope: A URL might not have been archived because it was out of scope for the crawl. You may want to review how scoping works in order to understand why something would be considered out of scope. You can also review the 'out of scope' column of each crawl's Hosts report in order to see what URLs were deemed 'out of scope' and subsequently were not archived. If a URL that you want to archive is 'out of scope,' then you may want to expand the scope of your crawl. Note that one common reason for something being out of scope is the desired page being part of a sub-domain of your seed, which by default will not be automatically archived. Sub-domains are directories named to the left of the seed site, ex. crawler.archive.org (crawler is the sub-domain). If you want to be sure to crawl sub-domains, you need to list them as separate seeds on your seed list or expand the scope of your seeds.
Time, data, or document limit: If the URL was in scope, your crawl may have stopped due to a time, data, or document limit before it could collect all the content that was in scope. The Queued column in your crawl's Hosts report lists the URLs next in line to be crawled from each host if the crawl were to continue. Note: Especially high numbers of queued URLs are typically an indication of a crawler trap.

Articles in this section

Comments

Articles in this section

Related articles