What is the difference between a seed and a host?

Seeds are starting point URLs for the crawler. For example, http://www.archive.org. A host is where web content is stored. For example, www.archive.org is where the majority of content on http://www.archive.org will be stored. Embedded content on a page, such as images, styling information like CSS or Javascript, social media widgets, etc may be stored on hosts other than www.archive.org. Embedded content is in scope by default, and so one seed may have content and URLs captured from one or many hosts. Note that when referring to a host, no "http://" precedes the address.

Articles in this section

Comments

Articles in this section

Related articles