About Patch crawls

What is a patch crawl?

A patch crawl targets and collects specific documents that the Wayback QA tool discovered to be missing from an archived site or that are listed in the Blocked column of a crawl’s Hosts report.

How do patch crawls differ from other crawls?

Patch crawls differ from other crawls because they only collect single documents. They don’t discover and collect additional embedded documents or follow links. Patch crawls also differ from other crawls in the following ways:

Patch crawls exclusively use the Standard crawler.
Seed and Collection scope rules do not apply to patch crawls.
Patch crawls have a default 24-hour time limit.
Patch crawls do not run Archive-It’s A/V collecting utility (yt-dlp) or collect additional metadata needed to replay media.

When should I use a patch crawl?

Patch crawls are helpful when:

Embedded documents are not easily discoverable by Archive-It's crawlers.
Embedded documents are missing and you don’t plan to crawl the entire site again.

Use patch crawls to collect individual documents like:

embedded images
CSS, JS, or other documents that make up a webpage

Patch crawls are not an efficient way to collect:

missing pages or sections of a website
embedded video or audio

Adjusting the format of your seed URL, scope rules, crawling technology, or crawl time limits can help address these types of missing content.

Articles in this section

What is a patch crawl?

How do patch crawls differ from other crawls?

When should I use a patch crawl?

Comments

Articles in this section

What is a patch crawl?

How do patch crawls differ from other crawls?

When should I use a patch crawl?

Related articles