What is "scope" and how does it work?
"Scope" is the extent to which the crawler will travel to discover and archive new materials. To avoid crawling the entire web, Archive-It crawls have a default scope. Sometimes, material that is valuable to your collection nonetheless gets automatically ruled "out of scope" because of the manner in which seeds determine the default scope of your crawl. Similarly, your crawl may discover and archive more materials than you had intended. When this happens, you can modify and improve your scope for future crawls.
Some platforms/sites require specific scope expansions/limitations to best archive certain kinds of seed sites. Before crawling them, please review our step-by-step guidance on the following: Scoping crawls for specific types of sites.
Collection level scoping
By modifying your collection's scope, you can either limit (archive fewer URLs) or expand (archive more URLs) what would be captured by default, for all crawls on all seeds, in that specific collection. You can make changes by navigating to the collection, and selecting “Collection Scope” where you can: block content, add limits, ignore robots, or include content:
Seed level scoping
By modifying a seed's scope, you can either limit (archive fewer URLs) or expand (archive more URLs) what would be captured by default, for all crawls on that specific seed.
To modify a seed’s scope, navigate to the collection's "Seeds" tab, click on the hyperlinked URL of the seed, followed by that seed's "Seed Scope" tab to: add limits, avoid robots exclusions, and block or include specific URLs when archiving:
When should either level be used?
Collection level scope modifications instruct our crawler to archive more or less content, based on your specifications, during the course of its entire crawl, across all seeds, within a specific collection. Seed level scoping further allows you to specify precise rules to the crawls of specific seeds in a larger crawl.
For instance, if you were to limit the scope of your crawl at the collection level to block content from the host example-host.com, then content from that site could be blocked universally, meaning at any time during the crawl and regardless of the seed site that led our crawler to it. If you were to, alternatively, limit the scope of your crawl to block content from the same example-host.com host site at the specific level of a seed, such as http://www.example-seed1.com/, then any content encountered in the course of a crawl originating from http://www.example-seed1.com/ would be blocked, whereas content encountered in the course of the crawl originating from http://www.example-seed2.com/ would not. You will no longer need to set these scoping rules at the collection level and toggle them on/off based upon the seeds that you are crawling at the current time.
Use Seed level scoping rules...
- When you want to ignore robot.txt for all content from a specific seed
- When you want to limit the amount of data from each seed in your collection at a more granular level
- When you want to capture embedded content (ex.YouTube videos) from one seed but not from another
- When you want to block a host from being crawled in one seed but allow it in another
- When you want to include links to an external host discovered from one seed and not another
Use Collection level scoping rules...
- When you want to ignore robots.txt for a specific host universally across your collection
- When you want to limit the amount of data from a specific host across all crawls
- When you want to capture embedded content (ex. YouTube videos) from all seeds
- When you want to block a host from being crawled entirely in your collection