What is "scope" and how does it work?
"Scope" is the extent to which the crawler will travel to discover and archive new materials. To avoid crawling the entire web, Archive-It crawls have a default scope. Sometimes, material that is valuable to your collection nonetheless gets automatically ruled "out of scope" because of the manner in which seeds determine the default scope of your crawl. Similarly, your crawl may discover and archive more materials than you had intended. When this happens, you can modify and improve your scope for future crawls.
Some platforms/sites require specific scope expansions/limitations to best archive certain kinds of seed sites. Before crawling them, please review our step-by-step guidance on the following: Scoping crawls for specific types of sites.
Collection level scoping
By modifying your collection's scope, you can either limit (archive fewer URLs) or expand (archive more URLs) what would be captured by default, for all crawls on all seeds, in that specific collection. You can make changes by navigating to the collection, and selecting “Collection Scope” where you can: block content, add limits, ignore robots, or include content:
Seed level scoping
By modifying a seed's scope, you can either limit (archive fewer URLs) or expand (archive more URLs) what would be captured by default, for all crawls on that specific seed.
To modify a seed’s scope, navigate to the collection's "Seeds" tab, click on the hyperlinked URL of the seed, followed by that seed's "Seed Scope" tab to: add limits, avoid robots exclusions, and block or include specific URLs when archiving:
When should either level be used?
Collection level scope rules broadly apply to all seeds/crawls in your collection. Seed level scope rules will only apply to content discovered from the specific seed to which the rule is applied.
For instance, if you were to add a Block Host rule for the host example.com at the collection level, then content from that host would be blocked anytime the crawler discovered it from any seed. If you were to, alternatively, add a rule to block the host example-host.com to a specific seed, then content from that host would be blocked anytime the crawler discovered it via that seed. If the crawler were to discover content from example-host.com via another seed it could be captured.
Use Seed level scoping rules...
- When you want to ignore robot.txt for all content from a specific seed
- When you want to limit the amount of data from each seed in your collection at a more granular level
- When you want to capture embedded content (ex.YouTube videos) from one seed but not from another
- When you want to block a host from being crawled in one seed but allow it in another
- When you want to include links to an external host discovered from one seed and not another
Use Collection level scoping rules...
- When you want to ignore robots.txt for a specific host universally across your collection
- When you want to limit the amount of data from a specific host across all crawls
- When you want to capture embedded content (ex. YouTube videos) from all seeds
- When you want to block a host from being crawled entirely in your collection
Comments
0 comments
Please sign in to leave a comment.