Seed-level WARCs and data de-duplication

Comments

1 comment

  • Avatar
    Lori Donovan

    Hi Russell, great question. This is currently handled differently for Brozzler and Standard (Heritrix) crawls.

    - Brozzler was designed to write per-seed WARCs from its inception, and has always written deduplication information per-seed, meaning that content for each seed is unique, and replay would not rely on the availability of content in other seeds' WARCs. This does also mean that common resources may be captured multiple times, if they are referenced by different seeds.

    - Standard (Heritrix) crawls shifted from writing WARCs per crawl, to writing them per seed in September 2018. Deduplication for Standard crawls is still done at the collection, rather than seed level, although this is something we may adjust in the future.

Please sign in to leave a comment.