On this page:
- What are scope rules?
- Exclude Hosts
- Exclude Document If
- Accept Document If
- Exclude Audio and Video
- Limit Data
- Limit Documents
- Ignore robots.txt
- Ignore crawl delay
What are scope rules?
Scope rules let you control what Archive-It’s crawlers do and do not collect.
You can set rules at two levels:
Collection level: Applies to all crawls in the collection. Most collection-level rules are added to hosts and apply anytime crawlers discover content from that host. These rules apply to the specified host and all of its subdomains. For example, a rule applied to the host archive.org would also apply to blog.archive.org.
Seed level: Applies to any content discovered via that seed. Seed-level rules apply to any host discovered via the seed to which the rule is applied.
Here are the rules you can use and where you can use them:
Exclude Host
Available at the collection-level only.
Prevents the crawler from collecting any content from a specified host, including all of its subdomains.
Exclude Document If…
Prevents the crawler from collecting documents that match a specified text string, regular expression (regex), or SURT.
Use this to exclude:
Specific directories or strings
Certain file types (like
.pdfor.mp3)Patterns like repeating or extra directories (with regex)
At collection level: applies to a specific host.
At seed level: applies to all content discovered through that seed.
Accept Document If…
Overrides default scope to include documents that match a specified text string, regex, or SURT.
Common uses include:
Scoping in specific subdomains
Collecting linked documents hosted on a different host than the seed URL
At collection level: applies to all crawls.
At seed level: applies to content discovered through that seed.
Exclude Audio and Video
Available at the seed level only.
Prevents Archive-It's crawlers from collecting audio and video files by:
- Disabling the A/V collecting utility, yt-dlp, for a seed
Excluding all audio/video MIME types from a seed
Limit Data
Sets a cap on the amount of new data (by size) collected from a source. Once the limit is reached, the crawler stops collecting from that source but may continue elsewhere.
At collection level: limits data collected from a given host, regardless of how it was discovered.
At seed level: limits data collected via that seed (potentially spanning multiple hosts).
Limit Documents
Sets a cap on the number of total documents (by count) collected from a source. Once the limit is reached, the crawler stops collecting from that source but may continue elsewhere.
At collection level: limits documents collected from a given host, regardless of how it was discovered.
At seed level: limits documents collected via that seed (potentially spanning multiple hosts).
Ignore robots.txt
Tells the crawler to ignore robots.txt exclusions, which might otherwise block certain directories, file types, or entire sites.
At collection level: applies to that host and all content discovered from it.
At seed level: applies to all content discovered through that seed, regardless of host.
Ignore Crawl Delay
Tells the crawler to ignore any crawl-delay directives in a robots.txt file that would otherwise slow crawling.
At collection level: applies to that host and all content discovered from it.
At seed level: applies to all content discovered through that seed, regardless of host.
Comments
0 comments
Please sign in to leave a comment.