Scope Rules and how to use them

On this page:

What are scope rules?
Exclude Hosts
Exclude Document If
Accept Document If
Exclude Audio and Video
Limit Data
Limit Documents
Ignore robots.txt
Ignore crawl delay

What are scope rules?

Scope rules let you control what Archive-It’s crawlers do and do not collect.

You can set rules at two levels:

Collection level: Applies to all crawls in the collection. Most collection-level rules are added to hosts and apply anytime crawlers discover content from that host. These rules apply to the specified host and all of its subdomains. For example, a rule applied to the host archive.org would also apply to blog.archive.org.

Seed level: Applies to any content discovered via that seed. Seed-level rules apply to any host discovered via the seed to which the rule is applied.

Here are the rules you can use and where you can use them:

Exclude Host

Available at the collection-level only.
Prevents the crawler from collecting any content from a specified host, including all of its subdomains.

Exclude Document If…

Prevents the crawler from collecting documents that match a specified text string, regular expression (regex), or SURT.
Use this to exclude:

Specific directories or strings
Certain file types (like .pdf or .mp3)
Patterns like repeating or extra directories (with regex)

At collection level: applies to a specific host.

At seed level: applies to all content discovered through that seed.

Accept Document If…

Overrides default scope to include documents that match a specified text string, regex, or SURT.
Common uses include:

Scoping in specific subdomains
Collecting linked documents hosted on a different host than the seed URL

At collection level: applies to all crawls.

At seed level: applies to content discovered through that seed.

Exclude Audio and Video

Available at the seed level only.
Prevents Archive-It's crawlers from collecting audio and video files by:

Disabling the A/V collecting utility, yt-dlp, for a seed
Excluding all audio/video MIME types from a seed

Limit Data

Sets a cap on the amount of new data (by size) collected from a source. Once the limit is reached, the crawler stops collecting from that source but may continue elsewhere.

At collection level: limits data collected from a given host, regardless of how it was discovered.

At seed level: limits data collected via that seed (potentially spanning multiple hosts).

Limit Documents

Sets a cap on the number of total documents (by count) collected from a source. Once the limit is reached, the crawler stops collecting from that source but may continue elsewhere.

At collection level: limits documents collected from a given host, regardless of how it was discovered.

At seed level: limits documents collected via that seed (potentially spanning multiple hosts).

Ignore robots.txt

Tells the crawler to ignore robots.txt exclusions, which might otherwise block certain directories, file types, or entire sites.

At collection level: applies to that host and all content discovered from it.

At seed level: applies to all content discovered through that seed, regardless of host.

Ignore Crawl Delay

Tells the crawler to ignore any crawl-delay directives in a robots.txt file that would otherwise slow crawling.

At collection level: applies to that host and all content discovered from it.

At seed level: applies to all content discovered through that seed, regardless of host.

Articles in this section

What are scope rules?

Exclude Host

Exclude Document If…

Accept Document If…

Exclude Audio and Video

Limit Data

Limit Documents

Ignore robots.txt

Ignore Crawl Delay

Comments

Articles in this section

What are scope rules?

Exclude Host

Exclude Document If…

Accept Document If…

Exclude Audio and Video

Limit Data

Limit Documents

Ignore robots.txt

Ignore Crawl Delay

Related articles