My institution is excited about the new procedure of writing WARC files per seed rather than per crawl as described here.
One thing we're wondering about is how seed-level WARCs work with data-de-duplication. I'm thinking of the scenario where several seeds within a collection use a common resource, like a CSS file. This happens pretty frequently when we collect from large domains (e.g., gc.ca) using many URLs within the domain as seeds.
Would the common resource be captured multiple times and wrapped in each seed's WARCs so that the WARCs are a self-contained archive for that seed? Or would it only be captured once in the collection? We'd be curious to know how AIT is approaching this--obviously pros and cons to each.
Please sign in to leave a comment.