On this page:
- Default Crawl Scope
- Scope Specifics - Standard crawls
- Scope Specifics - Brozzler crawls
- How to scope in sub-domains
Default Crawl Scope - Standard and Brozzler
-
The crawler will collect all embedded elements (images, stylesheets, JavaScript, etc) on your seed site's pages.
-
The crawler will not follow links to other sites, unless specified.
-
A link to https://archive.org/about.html IS in scope.
-
A link to a different site, such as http://ca.gov IS NOT in scope.
-
An embedded image on an https://archive.org webpage, like http://ala.org/logo.jpg, IS in scope.
Scope rule order of precedence
You can also expand or limit what is collected by default by adding scope rules. Archive-It crawlers (both Brozzler and Standard) apply these logical steps to decide whether a URL that doesn't match the pattern of "default scope" described above is in or out of scope:
- If the number of hops from the seed URL is greater than 100 (ex. https://example.com/pg/101 in a series of pages), the URL is out of scope.
- Otherwise, if any
block
rule matches, the URL is out of scope. - Otherwise, if any
accept
rule matches, the URL is in scope*. - Otherwise, if the seed type is One-Page+ or Standard+ and the URL is at most 1 hop from the last page that was in scope, the url is in scope.
- Otherwise (no rules match), the url is out of scope.
In cases of conflict, block
rules take precedence over accept
rules.
If a rule is specified both at the collection level and at the seed level, the results are merged. In cases of conflict, the seed-level value takes precedence.
*Brozzler gives precedence to Seed Type over accept
rules, which means it will not follow accept
rules off of One-Page or One-Page+ seeds.
Scope specifics - Standard crawls
Seeds
The Standard crawling technology uses all seed URLs in a given crawl to determine that crawl's scope. This means, if seeds include links to one another, it's possible for content from one seed to be discovered and crawled via another seed.
Seed Format
- https://archive.org/webarchive.html IS NOT in scope; it is not part of the /about/ directory
- https://archive.org/about/bios.html IS in scope
If your seed is https://archive.org/about the Standard crawler will start on this page, but can access all other directories of the site.
- https://archive.org/webarchive.html IS in scope
- https://archive.org/about/bios.html IS in scope
Protocols
The Standard crawler puts both the HTTP and HTTPS versions of the URL in scope. You should not need to add any additional seeds or scoping rules to capture both versions.
Scope specifics - Brozzler crawls
Seed Type
Brozzler gives precedence to the Seed Type over expand scope rules. This means it will not follow links off of One-Page seeds, even if there are additional rules to scope them in.
Seed Format
Brozzler scopes in any URL that begins with the same string as your seed URL.
For Example, if your seed is https://archive.org/about
- https://archive.org/webarchive.html IS NOT in scope; it does not begin with https://archive.org/about
- https://archive.org/about/bios.html IS in scope
Protocols
By default, Brozzler scopes in only the protocol (HTTP or HTTPS) of the seed URL. If your seed URL starts with HTTP, documents beginning with HTTPS will be out of scope and vice versa. If, after running a test crawl, you find that documents using both protocols are necessary for a complete capture, consider adding and crawling seed URLs with both (ex. https://example.com and http://example.com).
How to scope in sub-domains
Sub-domains are divisions of a larger site named to the left of the host name (i.e. crawler in https://crawler.archive.org/ is a sub-domain of archive.org).
Sub-domains of seed URLs are NOT included in the scope of your crawl by default.
The www prefix is considered a sub-domainIf content from the www or non-www version of your seed’s host is Out of Scope in your post-crawl report, the suggestions for scoping in sub-domains listed above can help.
|
With the Standard Crawler
There are 3 different ways you can scope in sub-domains when using the Standard crawler. You can:
- Format your seed URL without www or ending slash. For example, listing http://ca.gov as your seed URL would allow you to archive all linked sub-domains of ca.gov. You can ignore the automated suggestion to add a slash (/) to the end of your seed.
- Use an Expand Scope SURT that tells the crawler to capture all subdomains of a given site. For Example, if your seed is https://archive.org/ your expand scope SURT would be https://(org,archive,
-
Add individual seed URLs for each sub-domain. For example, if you only want to collect http://ca.gov and http://governor.ca.gov, then add each to your collection as its own seed URL.
With Brozzler
- Use Expand Scope SURT that tells the crawler to capture all subdomains of a given site. For Example, if your seed is https://archive.org/ your expand scope SURT would be https://(org,archive,
- Add individual seed URLs for each sub-domain. For example, if you only want to collect http://ca.gov and http://governor.ca.gov, then add each to your collection as its own seed URL.
Comments
0 comments
Please sign in to leave a comment.