In Archive-It, scope refers to the links and embedded elements that are collected during a crawl. This article helps explain what you can expect a crawl using the Standard seed type to collect, how scope rules are applied, and differences in scope between crawlers.

On this page:

Default Crawl Scope

Scope rule order of precedence

Scope Specifics - Standard crawls

Seeds
Seed Format
Protocols

Scope Specifics - Brozzler crawls

Seed Type
Seed Format
Protocols

How to scope in sub-domains

With the Standard Crawler
With Brozzler

Default Crawl Scope - Standard and Brozzler

The format of a seed URL and its "seed type" determine the default scope of a crawl.

In general:

The crawler will collect all embedded elements (images, stylesheets, JavaScript, etc) on your seed site's pages.
The crawler will not follow links to other sites, unless specified.

For complete guidance on Seed Types and when to choose each, see: How to assign and edit a "seed type".

The Standard Seed Type describes how Archive-It crawlers operate by default. It directs the crawler to archive all URLs (up to 100 hops away from your seed URL) that match your seed URL.

To archive all pages on https://archive.org, you would use https://archive.org/ as your seed URL and assign it the "Standard" seed type. Then, all URLs that start with https://archive.org are "in scope". For instance:

A link to https://archive.org/about.html IS in scope.
A link to a different site, such as http://ca.gov IS NOT in scope.
An embedded image on an https://archive.org webpage, like http://ala.org/logo.jpg, IS in scope.

Scope rule order of precedence

You can also expand or limit what is collected by default by adding scope rules. Archive-It crawlers (both Brozzler and Standard) apply these logical steps to decide whether a URL that doesn't match the pattern of "default scope" described above is in or out of scope:

If the number of hops from the seed URL is greater than 100 (ex. https://example.com/pg/101 in a series of pages), the URL is out of scope.
Otherwise, if any exclude rule matches, the URL is out of scope.
Otherwise, if any accept rule matches, the URL is in scope*.
Otherwise, if the seed type is One-Page+ or Standard+ and the URL is at most 1 hop from the last page that was in scope, the url is in scope.
Otherwise (no rules match), the url is out of scope.

In cases of conflict, exclude rules take precedence over accept rules.

If a rule is specified both at the collection level and at the seed level, the results are merged. In cases of conflict, the seed-level value takes precedence.

*Brozzler gives precedence to Seed Type over accept rules, which means it will not follow accept rules off of One-Page or One-Page+ seeds.

Scope specifics - Standard crawls

Seeds

The Standard crawling technology uses all seed URLs in a given crawl to determine that crawl's scope. This means, if seeds include links to one another, it's possible for content from one seed to be discovered and crawled via another seed.

Seed Format

The Standard crawler uses the final forward slash ('/') in a seed URL to help determine scope. It will only collect URLs that follow the slash.

You can focus the Standard crawler on a specific directory by using a seed like this https://archive.org/about/.

https://archive.org/webarchive.html IS NOT in scope; it is not part of the /about/ directory
https://archive.org/about/bios.html IS in scope

If your seed is https://archive.org/about the Standard crawler will start on this page, but can access all other directories of the site.

https://archive.org/webarchive.html IS in scope
https://archive.org/about/bios.html IS in scope

Protocols

The Standard crawler puts both the HTTP and HTTPS versions of the URL in scope. You should not need to add any additional seeds or scoping rules to capture both versions.

Scope specifics - Brozzler crawls

Seed Type

Brozzler gives precedence to the Seed Type over expand scope rules. This means it will not follow links off of One-Page seeds, even if there are additional rules to scope them in.

Seed Format

Brozzler scopes in any URL that begins with the same string as your seed URL.

For Example, if your seed is https://archive.org/about

https://archive.org/webarchive.html IS NOT in scope; it does not begin with https://archive.org/about
https://archive.org/about/bios.html IS in scope

If your seed URL ends with an extension like index.html, you can:
Remove that extension from your seed URL
-or-
Add Seed Level expand scope rules to allow the crawler to access the rest of the site.

Protocols

By default, Brozzler scopes in only the protocol (HTTP or HTTPS) of the seed URL. If your seed URL starts with HTTP, documents beginning with HTTPS will be out of scope and vice versa. If, after running a test crawl, you find that documents using both protocols are necessary for a complete capture, consider adding and crawling seed URLs with both (ex. https://example.com and http://example.com).

How to scope in sub-domains

Sub-domains are divisions of a larger site named to the left of the host name (i.e. crawler in https://crawler.archive.org/ is a sub-domain of archive.org).

Sub-domains of seed URLs are NOT included in the scope of your crawl by default.

ℹ️ The www prefix is considered a sub-domain
If content from the www or non-www version of your seed’s host is Out of Scope in your post-crawl report, the suggestions for scoping in sub-domains listed above can help.

With the Standard Crawler

There are 3 different ways you can scope in sub-domains when using the Standard crawler. You can:

Format your seed URL without www or ending slash. For example, listing http://ca.gov as your seed URL would allow you to archive all linked sub-domains of ca.gov. You can ignore the automated suggestion to add a slash (/) to the end of your seed.
Use an Expand Scope SURT that tells the crawler to capture all subdomains of a given site. For Example, if your seed is https://archive.org/ your expand scope SURT would be https://(org,archive,
Add individual seed URLs for each sub-domain. For example, if you only want to collect http://ca.gov and http://governor.ca.gov, then add each to your collection as its own seed URL.

Always run a test crawl on any seeds set up to collect sub-domains.

With Brozzler

Use Expand Scope SURT that tells the crawler to capture all subdomains of a given site. For Example, if your seed is https://archive.org/ your expand scope SURT would be https://(org,archive,
Add individual seed URLs for each sub-domain. For example, if you only want to collect http://ca.gov and http://governor.ca.gov, then add each to your collection as its own seed URL.

Articles in this section

How Archive-It crawlers determine scope

Default Crawl Scope - Standard and Brozzler

Scope rule order of precedence

Scope specifics - Standard crawls

Seeds

Seed Format

Protocols

Scope specifics - Brozzler crawls

Seed Type

Seed Format

Protocols

How to scope in sub-domains

With the Standard Crawler

With Brozzler

Comments

Articles in this section

Default Crawl Scope - Standard and Brozzler

Scope rule order of precedence

Scope specifics - Standard crawls

Seeds

Seed Format

Protocols

Scope specifics - Brozzler crawls

Seed Type

Seed Format

Protocols

How to scope in sub-domains

With the Standard Crawler

With Brozzler

Related articles