What is "crawl scope" and how does it work?
"Crawl Scope" is the extent to which our crawler will travel in order to discover and archive new materials. By default, it is determined by two things:
- The seed URLs that you have chosen to crawl – See our guidance on selecting your seeds for more information on how your seed URLs determine your default crawl scope.
- Any advanced scoping rules entered into the Crawl Scope tab of your collection's management page within our web application.
How our crawler determines what to crawl: Default Crawl Scope
The crawler uses your seed URLs to determine the scope of your crawl. By default, the crawler will start at your seed URL, then follow links within your seed site to archive pages.
Question: What is captured by default?
Answer: All links from your seed URL host and all embedded content required to render the page as it appears on the live web.
Example Seed www.archive.org/
Embedded image: www.ala.org/logo.jpg IS in scope
Question: What isn’t captured by default?
Answer: All links out to different hosts as well as subdomains of the seed URL (divisions of a larger site named to the left of the host name. Ex. crawler.archive.org)
Example Seed www.archive.org/
Link: www.ca.gov is NOT in scope
For more details on default crawling behavior, consult our complete guide to how our crawler determines scope.
Sometimes, material that is valuable to your collection nonetheless gets automatically ruled "out of scope" because of the manner in which seeds determine the default scope of your crawl. Similarly, your crawl may discover and archive more materials than you had intended. When this happens, you can use the advanced scoping rules below to modify and improve your scope for future crawls. (If you are unsure about how much of the material you crawled was archived or deemed out of scope, you can review your crawl's report to find out).
How to modify the default scope of your crawl
By modifying your collection's crawl scope, you can either limit (archive fewer URLs) or expand (archive more URLs) during the course of a normal crawl.
To do either of the above, you can modify your scope at either the collection or seed level.
Collection level scoping
Begin by navigating to your collection's management page in our web application, followed by the Crawl Scope tab:
This section of the web application includes the tools that you will need to expand your scope to include more material, to limit your crawler from archiving too much material, or to overcome blocks to our crawler. For guidance through each of these operations, consult the specific articles in this section: Scoping Crawls.
Seed level scoping
To modify scope at the level of a particular seed, navigate to the collection's "Seeds" tab, click on the hyperlinked URL of the seed for which you wish to make a modification, followed by that seed's "Seed Scope" tab:
How to scope crawls for specific types of sites
We recommend specific scope expansions/limitations to best archive certain kinds of seed sites. Before crawling them, please review our step-by-step guidance on the following: Scoping crawls for specific types of sites.