What is "scope" and how does it work?

"Scope" is the extent to which the crawler will travel to discover and archive new materials. To avoid crawling the entire web, Archive-It crawls have a default scope. Sometimes, material that is valuable to your collection gets automatically ruled "out of scope" because of the manner in which seeds determine the default scope of your crawl. Similarly, your crawl may discover and archive more materials than you had intended. When this happens, you can modify and improve your scope for future crawls.

Some platforms/sites like YouTube and Wix require specific scope expansions or exclusions to best archive certain kinds of seed sites. Before crawling them, review our step-by-step guidance: Scoping crawls for specific types of sites.

Collection level scope

By modifying your collection's scope, you can either limit (archive fewer documents) or expand (archive more documents) what is collected by default, for all crawls on all seeds in that specific collection.

To modify a collection's scope:

Go to your collection and select Collection Scope.
For existing rules, use the controls to toggle off or delete a rule.
To add a new rule, select the rule you want from the drop-down menu. For more information on your options, see Scope Rules and how to use them.
Click Add Rule.

Seed level scope

By modifying a seed's scope, you can either limit (archive fewer documents) or expand (archive more documents) what is collected by default, for all crawls on that specific seed.

To modify a seed’s scope:

Go to the collection’s Seeds tab.
Click the hyperlinked URL of the seed you want to modify.
Click the seed’s Seed Scope tab.
For existing rules, use the controls to toggle off or delete a rule.
To add a new rule, select the rule you want from the drop-down menu. For more information on your options, see Scope Rules and how to use them.
Click Add Rule.

When should either level be used?

Collection level scope rules broadly apply to all seeds and crawls in your collection. Seed level scope rules will only apply to documents discovered from the specific seed to which the rule is applied.

For instance, if you add an Exclude Host rule for the host example.com at the collection level, then a document from that host will be excluded anytime the crawler discovers it from any seed. Alternatively, if you add a rule to exclude documents from the host example-host.com to a specific seed, then a document from that host will be excluded anytime the crawler discovers it via that seed. If the crawler discovers a document from example-host.com via another seed, it could be collected.

Use Seed level scope rules...

When you want to Ignore Robots.txt for all content from a specific seed
When you want to limit the amount of data from each seed in your collection at a more granular level
When you want to collect embedded content (for example, YouTube videos) from one seed but not from another seed
When you want to exclude a host from being crawled in one seed but allow it in another
When you want to include links to an external host discovered from one seed and not another

Use Collection level scope rules...

When you want to Ignore Robots.txt for a specific host universally across your collection
When you want to limit the amount of data from a specific host across all crawls
When you want to collect embedded content (for example, YouTube videos) from all seeds
When you want to exclude a host from being crawled entirely in your collection

Articles in this section

Modify your collection or seed scope

What is "scope" and how does it work?

Collection level scope

Seed level scope

When should either level be used?

Comments

Articles in this section

What is "scope" and how does it work?

Collection level scope

Seed level scope

When should either level be used?

Related articles