In Archive-It, scope refers to the links and embedded elements that are collected during a crawl. This article helps explain what you can expect a crawl using the Standard seed type to collect, how scope rules are applied, and differences in scope between crawlers.
On this page:
- Default Crawl Scope
- Scope Specifics - Standard crawls
- Scope Specifics - Brozzler crawls
- How to scope in sub-domains
Default Crawl Scope - Standard and Brozzler
The format of a seed URL and its "seed type" determine the default scope of a crawl.
In general:
- The crawler will collect all embedded elements (images, stylesheets, JavaScript, etc) on your seed site's pages.
- The crawler will not follow links to other sites, unless specified.
For complete guidance on Seed Types and when to choose each, see: How to assign and edit a "seed type".
The Standard Seed Type describes how Archive-It crawlers operate by default. It directs the crawler to archive all URLs (up to 100 hops away from your seed URL) that match your seed URL.
To archive all pages on https://archive.org, you would use https://archive.org/ as your seed URL and assign it the "Standard" seed type. Then, all URLs that start with https://archive.org are "in scope". For instance:
- A link to https://archive.org/about.html IS in scope.
- A link to a different site, such as http://ca.gov IS NOT in scope.
- An embedded image on an https://archive.org webpage, like http://ala.org/logo.jpg, IS in scope.
Scope rule order of precedence
You can also expand or limit what is collected by default by adding scope rules. Archive-It crawlers (both Brozzler and Standard) apply these logical steps to decide whether a URL that doesn't match the pattern of "default scope" described above is in or out of scope:
- If the number of hops from the seed URL is greater than 100 (ex. https://example.com/pg/101 in a series of pages), the URL is out of scope.
- Otherwise, if any
block
rule matches, the URL is out of scope. - Otherwise, if any
accept
rule matches, the URL is in scope*. - Otherwise, if the seed type is One-Page+ or Standard+ and the URL is at most 1 hop from the last page that was in scope, the url is in scope.
- Otherwise (no rules match), the url is out of scope.
In cases of conflict, block
rules take precedence over accept
rules.
If a rule is specified both at the collection level and at the seed level, the results are merged. In cases of conflict, the seed-level value takes precedence.
*Brozzler gives precedence to Seed Type over accept
rules, which means it will not follow accept
rules off of One-Page or One-Page+ seeds.
Scope specifics - Standard crawls
Seeds
The Standard crawling technology uses all seed URLs in a given crawl to determine that crawl's scope. This means, if seeds include links to one another, it's possible for content from one seed to be discovered and crawled via another seed.
Seed Format
The Standard crawler uses the final forward slash ('/') in a seed URL to help determine scope. It will only collect URLs that follow the slash.
You can focus the Standard crawler on a specific directory by using a seed like this https://archive.org/about/.
- https://archive.org/webarchive.html IS NOT in scope; it is not part of the /about/ directory
- https://archive.org/about/bios.html IS in scope
If your seed is https://archive.org/about the Standard crawler will start on this page, but can access all other directories of the site.
- https://archive.org/webarchive.html IS in scope
- https://archive.org/about/bios.html IS in scope
Protocols
The Standard crawler puts both the HTTP and HTTPS versions of the URL in scope. You should not need to add any additional seeds or scoping rules to capture both versions.
Scope specifics - Brozzler crawls
Seed Type
Brozzler gives precedence to the Seed Type over expand scope rules. This means it will not follow links off of One-Page seeds, even if there are additional rules to scope them in.
Seed Format
Brozzler scopes in any URL that begins with the same string as your seed URL.
For Example, if your seed is https://archive.org/about
- https://archive.org/webarchive.html IS NOT in scope; it does not begin with https://archive.org/about
- https://archive.org/about/bios.html IS in scope
If your seed URL ends with an extension like index.html, you can:
Remove that extension from your seed URL
-or-
Add Seed Level expand scope rules to allow the crawler to access the rest of the site.
Protocols
By default, Brozzler scopes in only the protocol (HTTP or HTTPS) of the seed URL. If your seed URL starts with HTTP, documents beginning with HTTPS will be out of scope and vice versa. If, after running a test crawl, you find that documents using both protocols are necessary for a complete capture, consider adding and crawling seed URLs with both (ex. https://example.com and http://example.com).
How to scope in sub-domains
Sub-domains are divisions of a larger site named to the left of the host name (i.e. crawler in https://crawler.archive.org/ is a sub-domain of archive.org).
Sub-domains of seed URLs are NOT included in the scope of your crawl by default.
The www prefix is considered a sub-domain
If content from the www or non-www version of your seed’s host is Out of Scope in your post-crawl report, the suggestions for scoping in sub-domains listed above can help.
|
With the Standard Crawler
There are 3 different ways you can scope in sub-domains when using the Standard crawler. You can:
- Format your seed URL without www or ending slash. For example, listing http://ca.gov as your seed URL would allow you to archive all linked sub-domains of ca.gov. You can ignore the automated suggestion to add a slash (/) to the end of your seed.
- Use an Expand Scope SURT that tells the crawler to capture all subdomains of a given site. For Example, if your seed is https://archive.org/ your expand scope SURT would be https://(org,archive,
- Add individual seed URLs for each sub-domain. For example, if you only want to collect http://ca.gov and http://governor.ca.gov, then add each to your collection as its own seed URL.
Always run a test crawl on any seeds set up to collect sub-domains.
With Brozzler
- Use Expand Scope SURT that tells the crawler to capture all subdomains of a given site. For Example, if your seed is https://archive.org/ your expand scope SURT would be https://(org,archive,
- Add individual seed URLs for each sub-domain. For example, if you only want to collect http://ca.gov and http://governor.ca.gov, then add each to your collection as its own seed URL.
Comments
0 comments
Please sign in to leave a comment.