Since the release of Archive-It 5.0, you can more selectively crawl content on the live web by making scope modifications at one or both of two levels: Collection and Seed. Collection level scope modifications instruct our crawler to archive more or less content, based on your specifications, during the course of its entire crawl, across all seeds, within a specific collection. Seed level scoping, a feature new to Archive-It 5.0, further allows you to specify precise rules to the crawls of specific seeds in a larger crawl. That is, if you wish to expand or restrict our crawler's archiving of content from any host website based upon the seed that led it there, you may now set those rules at the seed level.
For instance, if you were to limit the scope of your crawl at the collection level to block content from the host example-host.com, then content from that site could be blocked universally, meaning at any time during the crawl and regardless of the seed site that led our crawler to it. If you were to, alternatively, limit the scope of your crawl to block content from the same example-host.com host site at the specific level of a seed, such as http://www.example-seed1.com/, then any content encountered in the course of a crawl originating from http://www.example-seed1.com/ would be blocked, whereas content encountered in the course of the crawl originating from http://www.example-seed2.com/ would not. You will no longer need to set these scoping rules at the collection level and toggle them on/off based upon the seeds that you are crawling at the current time.
Collection level scoping modifications can be made at any time from the "Crawl Scope" tab in any collection's management area:
To modify scope at the level of a particular seed, navigate to the collection's "Seeds" tab, click on the hyperlinked URL of the seed for which you wish to make a modification, followed by that seed's "Seed Scope" tab:
Here, you may add data limits, avoid robots exclusions, and block or include specific URLs when archiving, just as you would previously at the collection level, but apply the changes solely to the specific seed and the websites to which it particularly leads our crawler.
When should either level be used?
This new feature introduces new (and hopefully quite helpful!) opportunities to refine the focus of your crawl, but we understand that it might not yet be clear which choice is best in every given scenario. For that reason, we in the meantime make the following general recommendation:
Use Seed level scoping rules...
- When you want to ignore robot.txt for all content from a specific seed
- When you want to limit the amount of data from each seed in your collection at a more granular level
- When you want to capture embedded content (ex.YouTube videos) from one seed but not from another
- When you want to block a host from being crawled in one seed but allow it in another
- When you want to include links to an external host discovered from one seed and not another
Use Collection level scoping rules...
- When you want to ignore robots.txt for a specific host universally across your collection
- When you want to limit the amount of data from a specific host across all crawls
- When you want to capture embedded content (ex. YouTube videos) from all seeds
- When you want to block a host from being crawled entirely in your collection