Seed-level WARCs and data de-duplication
My institution is excited about the new procedure of writing WARC files per seed rather than per crawl as described here.
One thing we're wondering about is how seed-level WARCs work with data-de-duplication. I'm thinking of the scenario where several seeds within a collection use a common resource, like a CSS file. This happens pretty frequently when we collect from large domains (e.g., gc.ca) using many URLs within the domain as seeds.
Would the common resource be captured multiple times and wrapped in each seed's WARCs so that the WARCs are a self-contained archive for that seed? Or would it only be captured once in the collection? We'd be curious to know how AIT is approaching this--obviously pros and cons to each.
-
Hi Russell, great question. This is currently handled differently for Brozzler and Standard (Heritrix) crawls.
- Brozzler was designed to write per-seed WARCs from its inception, and has always written deduplication information per-seed, meaning that content for each seed is unique, and replay would not rely on the availability of content in other seeds' WARCs. This does also mean that common resources may be captured multiple times, if they are referenced by different seeds.
- Standard (Heritrix) crawls shifted from writing WARCs per crawl, to writing them per seed in September 2018. Deduplication for Standard crawls is still done at the collection, rather than seed level, although this is something we may adjust in the future.
-
Update: As Lori suggested, we have since adjusted "standard" crawl deduplication to align with the Brozzler model. Since mid-January of this year, Heritrix-based crawl jobs also deduplicate at the seed level. An added benefit of this is that Brozzler and Standard crawls of the same seeds can deduplicate against each other now too, which prevents any unnecessary data expenditures when the same seeds are collected by the different technologies over time.
Please sign in to leave a comment.
Comments
2 comments