We are excited to help you set up and understand your trial account! Here is some introductory information to help you get started:
On this page:
- Basic Terminology
- Account Structure
- How the crawler knows what to capture
- What an archived page can and cannot do
Collection: A group of seed URLs with a similar theme, topic, or domain
Seed: The starting point URL for the crawler, and the access point for users of public collections
Document: Any file with a unique URL (html, image, PDF, video etc.). Seeds are made up of multiple documents
Crawler: Software that visits websites and indexes the information on those sites
Crawl: The action that the crawler takes to go out and archive your seeds
Archive-It accounts are made up of Collections of Seed URLs. Those seeds are the starting point for the Crawler and the main access points for users in public collections. A seed URL is made up of a number of individual URLs, called Documents.
We recommend browsing through some of the public collections that current users have created in order to get an idea of how collections can be structured. You can explore them all at: https://archive-it.org
How the crawler knows what to capture
For this trial, you are able to capture up to 5 Seed URLs.The crawler will start with your seed URLs and follow links within your seed site in order to archive pages. The way you format your seeds will help determine the scope of your crawl.
If, for instance, your seed URL is, then the crawler will archive all accessible content from the host , including , , and more. Any embedded content required for these page to render properly (images, stylesheets, etc.) will be archived. Links to other sites, such as , will not (by default) be archived. *If you want linked URLs archived, then they will need to be added as additional seed URLs.
If you only want to archive a specific directory of a site, you can format your seed with a / at the end to let the crawler know to only access content within that directory. For example, if you only want to crawl the "About" directory of, you would format your seed like this: .
Sub-domains, divisions of a larger site named to the left of the host name, as in, are not included by default. If there are sub-domains that you are interested in capturing, consider including some of them as seeds in your trial collection.
What an archived page can and cannot do
Archived pages should function in the same ways that they did on the live web at the time that they were archived, however some websites or types of content on them are designed in such a way that they are not easily archivable. The primary exceptions are:
By default, our crawler respects robots.txt exclusion files, which webmasters can use to block crawlers like ours from accessing their websites. However, Archive-It offers the option to ignore robots.txt on a host-by-host basis. If you know that your trial sites are blocked by robots.txt, we can apply our Ignore Robots.txt feature to the blocked hosts so that they may be archived. Some larger sites like Facebook or YouTube use robots.txt blocks on their pages. In trial crawls, we will usually add Ignore Robots.txt rules for these hosts automatically. *Please let us know beforehand if you would like to respect the robots.txt blocks for these sites.
To learn more about robots.txt you can read our Robots Exclusion Protocol help documentation.
Forms and search boxes
The functionality of forms and search boxes will not transfer to an archived page, however the database content is almost always accessible from another spot in the site, such as a site map or other direct links. Subscribing partners' public collections are full text searchable from the public access page on Archive-it.org
Videos typically capture and playback well, however there are some proprietary formats that are more difficult to capture and playback. Others will require additional rules to be in place before we run the crawl. YouTube videos, for example, usually capture well as embedded videos or from their respective watch or channel pages, but require additional preparation before a crawl is run. Vimeo videos are a bit more complicated; while they can be captured, they have additional requirements for capture and playback. *Please let us know if your sites have embedded videos.
If you have questions about the above concepts or anything else during your trial, please feel free to reach out directly to Archive-It's Web Archivists.