Seeds are starting point URLs for the crawler. For example, http://www.archive.org. A host is where web content is stored. For example, www.archive.org is where the majority of content on http://www.archive.org will be stored. Embedded content on a page, such as images, styling information like CSS or Javascript, social media widgets, etc may be stored on hosts other than www.archive.org. Embedded content is in scope by default, and so one seed may have content and URLs captured from one or many hosts. Note that when referring to a host, no "http://" precedes the address.
Articles in this section
- What are these screenshot:, thumbnail:, and youtube-dl: hosts in my crawl report?
- Why doesn’t my Flash content work?
- Can I run Wayback QA or a patch crawl on a test capture?
- How can I block individual hosts within a domain from archiving?
- What are all these other hosts listed in my crawl's Hosts report?
- What is the difference between a seed and a host?
- Why does my crawl report tell me that URLs were blocked?
- What is the difference between all and new documents/data?
- What do all the messages in the Status column of my Seeds report mean?
- Why didn't some pages get archived?
Comments
0 comments
Please sign in to leave a comment.