What is "scope" and how does it work?
"Scope" is the extent to which the crawler will travel to discover and archive new materials. To avoid crawling the entire web, Archive-It crawls have a default scope. Sometimes, material that is valuable to your collection gets automatically ruled "out of scope" because of the manner in which seeds determine the default scope of your crawl. Similarly, your crawl may discover and archive more materials than you had intended. When this happens, you can modify and improve your scope for future crawls.
Some platforms/sites like YouTube and Wix require specific scope expansions or exclusions to best archive certain kinds of seed sites. Before crawling them, review our step-by-step guidance: Scoping crawls for specific types of sites.
Collection level scope
By modifying your collection's scope, you can either limit (archive fewer documents) or expand (archive more documents) what is collected by default, for all crawls on all seeds in that specific collection.
To modify a collection's scope:
- Go to your collection and select Collection Scope.
- For existing rules, use the controls to toggle off or delete a rule.
- To add a new rule, select the rule you want from the drop-down menu. For more information on your options, see Scope Rules and how to use them.
- Click Add Rule.
Seed level scope
By modifying a seed's scope, you can either limit (archive fewer documents) or expand (archive more documents) what is collected by default, for all crawls on that specific seed.
When should either level be used?
Collection level scope rules broadly apply to all seeds and crawls in your collection. Seed level scope rules will only apply to documents discovered from the specific seed to which the rule is applied.
For instance, if you add an Exclude Host rule for the host example.com at the collection level, then a document from that host will be excluded anytime the crawler discovers it from any seed. Alternatively, if you add a rule to exclude documents from the host example-host.com to a specific seed, then a document from that host will be excluded anytime the crawler discovers it via that seed. If the crawler discovers a document from example-host.com via another seed, it could be collected.
Use Seed level scope rules...
- When you want to Ignore Robots.txt for all content from a specific seed
- When you want to limit the amount of data from each seed in your collection at a more granular level
- When you want to collect embedded content (for example, YouTube videos) from one seed but not from another seed
- When you want to exclude a host from being crawled in one seed but allow it in another
- When you want to include links to an external host discovered from one seed and not another
Use Collection level scope rules...
- When you want to Ignore Robots.txt for a specific host universally across your collection
- When you want to limit the amount of data from a specific host across all crawls
- When you want to collect embedded content (for example, YouTube videos) from all seeds
- When you want to exclude a host from being crawled entirely in your collection
Comments
0 comments
Please sign in to leave a comment.