How seeds, documents, and collections work together

On this page:

Seeds
Documents
- Calendar Pages
Collections
Seeds and documents as access points

What is a seed?

A seed is an item with a unique identifier in the Archive-It backend. A seed has associated data that does not change, like the dates on which it was added or updated and its crawl history. Seeds also have data that can be edited like Seed Level Metadata, notes, and even the seed URL.

Seed URLs point the crawler to content on the live web and, depending on how they are formatted, help inform the crawler of how much or how little of a site to collect.

Screen_Shot_2018-04-26_at_3.27.49_PM.png

Seeds URLs can be:

an entire website (ex: http://www.whitehouse.gov/)

a specific directory of a website (ex: http://www.whitehouse.gov/issues/foreign-policy/)

a specific document (ex:http://www.whitehouse.gov/sites/default/files/rss_viewer/national_security_strategy.pdf)

Documents are any file with a unique URL

Webpages are usually made up of many individual archived documents. The seed URLs in your collection point the crawler to the documents you want to collect on the live web. Even the unique seed URL is considered a document.

Screen_Shot_2018-04-26_at_3.27.57_PM.png

Calendar Pages

Every single archived document in your Archive-It collections has its own calendar page like the one below, listing each date and time on which it was collected. When you click on the Wayback link for a seed, you are being directed to the calendar page for that specific URL/document.

Collections are made up of lots of individual documents

After running a few production crawls your collection will be populated with archived documents, which, when viewed in Wayback, replay as archived websites.

The seed ID from which a document was collected is recorded in its WARC file. Aside from that, archived documents are not directly connected to the seed record in Archive-It. This means you can delete seeds or edit seed URLs without any effect on your archived content.

Editing a seed URL can change what it points to in the Wayback machine. Read more about how seeds act as access points below.

Screen_Shot_2018-04-26_at_3.28.09_PM.png

Seeds and Documents can be access points

Seed URLs can be used to provide direct access points to archived websites. Seed URLs don't necessarily need to have been crawled to function as an access point, as long as they point to a URL already in a collection. Individual archived documents can also be elevated to an access point using Document Metadata.

For example - If you crawled the seed URL https://mywebsite.com and the pages https://mywebsite.com/aboutme and https://mywebsite.com/mywork were collected, you could:

- Add a new seed with the URL https://mywebsite.com/aboutme that would automatically point users directly to that page in Wayback.

- Add document metadata to the document https://mywebsite.com/mywork and surface that page as an access point in your public collection.

Screen_Shot_2018-04-26_at_3.30.41_PM.png

Articles in this section

What is a seed?

Documents are any file with a unique URL

Calendar Pages

Collections are made up of lots of individual documents

Seeds and Documents can be access points

Comments

Articles in this section

What is a seed?

Documents are any file with a unique URL

Calendar Pages

Collections are made up of lots of individual documents

Seeds and Documents can be access points

Related articles