Seed-level WARCs and data de-duplication

Comments

2 comments

  • Avatar
    Lori Donovan

    Hi Russell, great question. This is currently handled differently for Brozzler and Standard (Heritrix) crawls.

    - Brozzler was designed to write per-seed WARCs from its inception, and has always written deduplication information per-seed, meaning that content for each seed is unique, and replay would not rely on the availability of content in other seeds' WARCs. This does also mean that common resources may be captured multiple times, if they are referenced by different seeds.

    - Standard (Heritrix) crawls shifted from writing WARCs per crawl, to writing them per seed in September 2018. Deduplication for Standard crawls is still done at the collection, rather than seed level, although this is something we may adjust in the future.

    0
    Comment actions Permalink
  • Avatar
    Karl Blumenthal

    Update: As Lori suggested, we have since adjusted "standard" crawl deduplication to align with the Brozzler model. Since mid-January of this year, Heritrix-based crawl jobs also deduplicate at the seed level. An added benefit of this is that Brozzler and Standard crawls of the same seeds can deduplicate against each other now too, which prevents any unnecessary data expenditures when the same seeds are collected by the different technologies over time.

    0
    Comment actions Permalink

Please sign in to leave a comment.