What is a patch crawl?
A patch crawl targets and collects specific documents that the Wayback QA tool discovered to be missing from an archived site or that are listed in the Blocked column of a crawl’s Hosts report.
How do patch crawls differ from other crawls?
Patch crawls differ from other crawls because they only collect single documents. They don’t discover and collect additional embedded documents or follow links. Patch crawls also differ from other crawls in the following ways:
- Patch crawls exclusively use the Standard crawler.
- Seed and Collection scope rules do not apply to patch crawls.
- Patch crawls have a default 24-hour time limit.
- Patch crawls do not run Archive-It’s A/V collecting utility (yt-dlp) or collect additional metadata needed to replay media.
When should I use a patch crawl?
Patch crawls are helpful when:
- Embedded documents are not easily discoverable by Archive-It's crawlers.
- Embedded documents are missing and you don’t plan to crawl the entire site again.
Use patch crawls to collect individual documents like:
- embedded images
- CSS, JS, or other documents that make up a webpage
Patch crawls are not an efficient way to collect:
- missing pages or sections of a website
- embedded video or audio
Adjusting the format of your seed URL, scope rules, crawling technology, or crawl time limits can help address these types of missing content.
Comments
0 comments
Please sign in to leave a comment.