The "seeds" that you select and how you format them will determine the content in your collection and the scope of your crawls. Seed selection is entirely up to your institution's collecting scope and policies.
When adding seeds to a collection, you can copy and paste large numbers of seeds at once from a text document, with each seed URL per line. In these cases, we recommend keeping a separate list of your seeds outside of the web application, such as in a text document or spreadsheet, for your own records. You may also add new seeds to an existing collection at any time manually and individually.
On this page:
What exactly is a seed?
A "seed" is both a starting point for the crawlers, as well as an access point to archived pages. A seed can be, for example:
- an entire website (example: http://www.whitehouse.gov/)
- a specific part (directory) of a website (example: http://www.whitehouse.gov/issues/foreign-policy/)
- a specific document (example: http://www.whitehouse.gov/sites/default/files/rss_viewer/national_security_strategy.pdf)
How to format your seeds
Generally, you may copy the URL from any website or web page that you wish to archive, just as it appears in your web browser, in order to properly format a seed. There are, however, important principles to remember before adding these URLs as seeds:
Do you need a / (slash) at the end of the URL? Because of the ways that our crawler determines a default scope based on the seed URL, not using an ending / (slash) could result in archiving far more content than you intended.
Does your URL have a # (hashtag)? Anything that comes after a # (hashtag) in a seed URL will be ignored by our crawler, which could significantly change the scope of your crawl.
What to expect when crawling seeds
Your precise selection of seed URL(s) will determine how much of each website you choose will be archived. To understand how our crawler decides what exact content does or does not belong in your archives based upon these selections, consult our complete guidance to how our crawler determines scope. Whenever selecting and crawling new seed URLs, we highly recommend that you review post crawl reports in order to evaluate how well you crawls matched your expectations. If necessary, you may always refine these crawls by modifying the default scope.
How to add new seeds to an existing collection
To add one or more new seeds to your collection, navigate to that collection's "Seeds" tab and click the Add Seeds Button:
Click the Add Seeds button and the new seed(s) will appear in your collection.
Deleting versus deactivating seeds
It is possible to deactivate or delete seeds. Deleting a seed will not delete its archived data, but it will remove it as an access point from the collection, as well as archive-it.org. Deactivating a seed will keep it visible in the seeds tab of a collection, as well as keep it available on archive-it.org, if set to be shared publicly. However, deactivating it makes it not possible to schedule a recurring crawl, and will alert you that the seed is inactive before a one-time crawl.