On this page:
- Default Crawl Scope
- Scoping Specifics - Standard crawls
- Scoping Specifics - Brozzler crawls
- How to scope in sub-domains
Default Crawl Scope - Standard and Brozzler
The default crawl scope for each seed in your collection is determined by how you format its URL and assign it a "seed type."
In general:
- Embedded content (images, stylesheets, JavaScript, etc) on your seed site's pages will be archived (whether or not they are hosted on the same site as the seed).
- Linked content outside of your seed site will not be archived, unless specified.
You can limit or expand how much linked content from your seed site is archived by assigning it a Seed Type. For complete guidance on Seed Types and when to choose each, see: How to assign and edit a "seed type".
The "Standard" Seed Type describes how Archive-It crawlers operate by default, without any special rules: It directs the crawler to archive all URLs (up to 100 hops away from your seed URL) discovered as part of a given seed site. For example, if you want to archive all of https://archive.org, then you would enter https://archive.org/ as your seed URL and assign it the "Standard" seed type. All URLs then found to be part of https://archive.org will be considered "in scope," crawled, and archived. For instance:
- A link to https://archive.org/about.html IS in scope and would be archived.
- A link to a different site, such as http://ca.gov IS NOT in scope and would NOT be archived.
- An embedded image on an https://archive.org webpage, such as http://ala.org/logo.jpg, IS in scope and would be archived.
Scope rule order of precedence
Archive-It crawlers (both Brozzler and Standard) apply these logical steps to decide whether a URL that doesn't match the pattern of "default scope" described above is in or out of scope:
- If the number of hops from the seed URL is greater than 100 (ex. https://example.com/pg/150 in a series of page), the URL is out of scope.
- Otherwise, if any
block
rule matches, the URL is out of scope. - Otherwise, if any
accept
rule matches, the URL is in scope. - Otherwise, if the seed type is One-Page+ or Standard+ and the URL is at most 1 hop from the last page that was in scope, the url is in scope.
- Otherwise (no rules match), the url is out of scope.
In cases of conflict, block
rules take precedence over accept
rules.
If a rule is specified both at the collection level and at the seed level, the results are merged. In cases of conflict, the seed-level value takes precedence.
Scoping specifics - Standard crawls
Seed Format
The Standard crawler uses the position of the final forward slash ('/') in a seed URL to help determine how much of the seed site is in scope, capturing only URLs that follow it.
For Example, you can tell the Standard crawler to capture only documents in a specific directory by using a seed like this https://archive.org/about/
- https://archive.org/webarchive.html IS NOT in scope; it is not part of the /about/ directory
- https://archive.org/about/bios.html IS in scope
If your seed is https://archive.org/about the crawler will start on this page, but will be able to access all other directories of the site
- https://archive.org/webarchive.html IS in scope
- https://archive.org/about/bios.html IS in scope
Protocols
When preparing to crawl a seed URL, the Standard crawler generates a SURT that puts both the HTTP and HTTPS versions of the URL in scope. You generally do not need to add any additional seeds or scoping rules to capture both versions.
Scoping specifics - Brozzler crawls
Seed Format
Brozzler scopes in anything that comes after your seed URL.
For Example, if your seed is https://archive.org/about
- https://archive.org/webarchive.html IS NOT in scope; it does not begin with https://archive.org/about
- https://archive.org/about/bios.html IS in scope
If your seed URL ends with an extension like index.html, you will need to either remove that extension from your seed or add Seed Level expand scope rules to allow the crawler to access the rest of the site.
For example, if your seed is https://example.com/index.html you could edit your seed URL to read https://example.com/. If removing that extension results in an invalid URL, you might consider adding an expand scope rule at the seed level for URLs that contain https://example.com/.
Protocols
By default, Brozzler scopes in only the protocol (HTTP or HTTPS) of the seed URL. If your seed URL starts with HTTP, documents beginning with HTTPS will be out of scope and vice versa. If, after running a test crawl, you find that documents using both protocols are necessary for a complete capture, consider adding and crawling seed URLs with both (ex. https://example.com and http://example.com).
How to scope in sub-domains
Sub-domains are divisions of a larger site named to the left of the host name (i.e. crawler in https://crawler.archive.org/ is a sub-domain of archive.org).
Sub-domains of seed URLs are NOT included in the scope of your crawl by default.
The www prefix is considered a sub-domain
All Archive-It crawling technology treats the www prefix as a sub-domain. Depending on how your seed site is constructed, including it or not including it in your seed URL can have an affect on your crawl’s scope.
For example, if your seed is https://archive.org/about/
- https://www.archive.org/about/bios.html may not be considered in scope
If you find that content from the www or non-www version of your seed’s host is being identified as Out of Scope in your post-crawl report, using the suggestions for scoping in sub-domains listed above can help.
With the Standard Crawler
- Format your seed URL so that the main domain only is listed and DO NOT include a www or ending slash. For example, listing http://ca.gov as your seed URL would allow you to archive all linked sub-domains of ca.gov. Be aware that when adding this seed to your collection, our web application may suggest that you add a slash (/) to the end of your seed. Be sure to not follow this suggestion and to instead keep the seed without the ending slash if you wish all sub-domains to be archived.
- Use Expand Scope SURT that tells the crawler to capture all subdomains of a given site. For Example, if your seed is https://archive.org/ your expand scope SURT would be https://(org,archive,
- Add individual seed URLs for each sub-domain if there are only a handful of sub-domains that you wish to crawl and archive. For example, if you only want to capture content from http://ca.gov and http://governor.ca.gov, then add each to your collection as its own seed URL.
Always run a test crawl on any seeds set up to capture sub-domains to make sure that your scoping settings are capturing what you intend. It is very easy for more content than you intended to be discovered when using these features.
With Brozzler
- Use Expand Scope SURT that tells the crawler to capture all subdomains of a given site. For Example, if your seed is https://archive.org/ your expand scope SURT would be https://(org,archive,
- Add individual seed URLs for each sub-domain if there are only a handful of sub-domains that you wish to crawl and archive. For example, if you only want to capture content from http://ca.gov and http://governor.ca.gov, then add each to your collection as its own seed URL.
Comments
0 comments
Please sign in to leave a comment.