On this page:
- Default Crawling Scope
- How to scope specific directories
- How to scope specific sub-domains
- Further Information
Default Crawling Scope
The default crawling scope for each seed in your collection is determined as you add it to the collection by how you format its URL and assign it a "seed type."
- Linked content outside of your seed site will not be captured, unless specified.
You may limit or expand how much linked content from your seed site and/or external sites is archived by assigning it a specially designed seed type. For complete guidance on these types and when to choose each, see: How to assign and edit a "seed type".
The "Standard" seed type describes how our crawler operates normally, without any special rules: It directs the crawler to archive all URLs discovered as part of a given seed site. For example, if you want to archive all of https://archive.org, then you would enter https://archive.org/ as your seed URL and assign it the "Standard" seed type. All URLs then found to be part of https://archive.org will be considered "in scope," crawled, and archived. For instance:
- A link to https://archive.org/about.html IS in scope and would be archived.
- A link to a different site, such as http://ca.gov IS NOT in scope and would NOT be archived.
- An embedded image on an https://archive.org webpage, such as http://ala.org/logo.jpg, IS in scope and would be archived.
Regardless of the seed type you choose, you may further refine your seed's or your entire collection's crawling scope by setting rules that direct our crawler to archive more or less content from specific host domains, of specific file types, according to specific patterns in their URLs, etc. For more information on how to expand or limit your scope, see: How to modify your collection's crawl scope.
Special Note: Using an ending slash ( / ) in your seed URL:
It is often (but not always) a good idea to end your seed URL with an ending slash ( / ). When adding seeds to your collection, our web application may in fact suggest that you add one. Here are a few simple rules of thumb to follow when deciding whether or not o use an ending slash with your seed URL:
Use an ending slash:
- When your seed is a path to a specific directory, such as https://www.archive.org/about/. More information about scoping seeds to specific directories can be found below.
- When your seed URL on the live web redirects to a URL that contains an ending slash. For example, if your seed is https://archive.org, you enter that into a browser, and the browser loads https://archive.org/, then you should include the ending slash. (It is generally good practice to copy your seed directly from your web browser's address bar).
- To put an entire site or host into scope, for example: https://archive.org/
DO NOT use an ending slash:
- If your seed is a single specific page within a larger site. For example, if your seed is https://archive-it.org/about/webarchiving.html, then adding an ending slash would result in an invalidly formed URL.
- If you want all sub-domains (i.e. crawler.archive.org or blog.archive.org) of your seed URL to be in scope. In these cases, see the specific directions provided below for crawling sub-domains.
How to scope specific directories
You can direct the crawler to archive only specific directories on a site by listing the desired directory as your seed URL, followed by a slash ('/').
For Example, if your seed is https://archive.org/about/
- https://archive.org/webarchive.html IS NOT in scope; it is not part of the /about/ directory.
- https://archive.org/about/bios.html IS in scope.
How to scope specific sub-domains
Sub-domains are divisions of a larger site named to the left of the host name (i.e. crawler in archive.org).is a sub-domain of
Sub-domains of seed URLs are NOT included in the scope of your crawl by default.
For example, if your seed URL is http://ca.gov/, then a link to is NOT in scope, and would need to be explicitly added to the scope of your crawl using one of the options listed below:
- Use Expand Scope rules as described in our guidance on expanding crawl scope.
- Format your seed URL so that the main domain only is listed and DO NOT include a www or ending slash. For example, listing as your seed URL would allow you archive all linked sub-domains of ca.gov. Be aware that when adding this seed to your collection, our web application may suggest that you add a slash (/) to the end of your seed. Be sure to not follow this suggestion and to instead keep the seed without the ending slash if you wish all sub-domains to be archived.
- Add individual seed URLs for each subdomain if there are only a handful of sub-domains that you wish to crawl and archive. For example, if you only want to capture content from http://ca.gov and http://governor.ca.gov, then add each to your collection as its own seed URL.
As always, we strongly recommend running a test crawl in order to make sure that your scoping settings are capturing what you intend, as it is very easy for more content than you intended to be discovered when using these features.
The www prefix is considered a subdomain
The Archive-It crawling technology treats the www prefix as a subdomain. Depending on how your seed site is constructed, including it or not including it in your seed URL can have an affect on your crawl’s scope.
For example, if your seed is https://archive.org/about/
- https://www.archive.org/about/bios.html may not be considered in scope
If you find that content from the www or non-www version of your seed’s host is being identified as Out of Scope in your post-crawl report, using the suggestions for scoping in sub-domains listed above can help.
Whenever necessary, you may modify your crawl scope from the default behaviors described above to more specifically limit or expand what is archived from a site.