The seeds URLs that you select and how you format them will determine the scope of your crawls and the content in your collection. Seed selection is entirely up to your institution's collecting scope and policies.
When adding seeds to a collection, you can copy and paste large numbers of URLs at once from a text document, one URL per line. In these cases, we recommend keeping a separate list of your seed URLs outside of the web application, such as in a text document or spreadsheet, for your own records. You can also add new seeds to an existing collection at any time manually and individually.
On this page:
What exactly is a seed?
A seed is an item with a unique identifier in the Archive-It backend. A seed has associated data that does not change, like the dates on which it was added or updated and its crawl history. Seeds also have data that can be edited like Seed Level Metadata, notes, and even the seed URL.
A seed URL is both a starting point for the crawlers, as well as an access point to archived pages. A seed URL can be, for example:
- an entire website (example: http://www.whitehouse.gov/)
- a specific part (directory) of a website (example: http://www.whitehouse.gov/issues/foreign-policy/)
- a specific document (example: http://www.whitehouse.gov/sites/default/files/rss_viewer/national_security_strategy.pdf)
How to format your seed URLs
Generally, you may copy the URL from any website or web page that you wish to archive, just as it appears in your web browser, in order to properly format a seed URL. There are, however, important principles to remember before adding these URLs as seeds:
Do you need a / (slash) at the end of the URL? Because of the ways that our crawler determines a default scope based on the seed URL, not using an ending / (slash) could result in archiving far more content than you intended.
Does your URL have a # (hashtag)? Anything that comes after a # (hashtag) in a seed URL will be ignored by our crawler, which could significantly change the scope of your crawl.
What to expect when crawling seeds
Your precise selection of seed URL(s) will determine how much of each website you choose will be archived. To understand how our crawler decides what exact content does or does not belong in your archives based upon these selections, consult our complete guidance to how our crawler determines scope. Whenever selecting and crawling new seed URLs, we highly recommend that you review post crawl reports in order to evaluate how well you crawls matched your expectations. If necessary, you may always refine these crawls by modifying the default scope.
How to add new seeds to an existing collection
To add one or more new seeds to your collection, navigate to that collection's "Seeds" tab and click the Add Seeds Button:
Click the Add Seeds button and the new seed(s) will appear in your collection.
Deleting versus deactivating seeds
Deleting a seed will not delete any archived data/Wayback captures. Deleting a seed will, however, remove all associated data, including crawl history and notes, as well as any metadata. Deleted seeds will no longer appear in your collection's seed list or as an access point on Archive-It.org.
Deactivating a seed will exclude it from future scheduled and one-time crawls. Inactive seeds will continue to be listed in the Seeds tab of your collection and, if set to Public, continue to act as an access point to Wayback captures.