What it is
Data de-duplication prevents duplicate data from being stored and counting against your data budget twice.
How it works
While a crawl is running, the de-duplication process checks against Wayback records collected from the same seed to determine whether or not a given document is new.
A document will be considered new if the digest/checksum value in its CDX entry is unique. If a document with the same digest has previously been collected via the same seed, the document will be de-duplicated. When this happens, the crawler doesn't collect the document. Instead, the crawler creates a warc/revisit record that references the last new capture of the document.
Data de-duplication occurs at the Seed Level. This means:
- If a live-web document originally collected from Seed A does not change in any way and is collected via Seed A again, data from the re-visit of that document will not count toward your Archive-It account's data budget a second time.
- If a live-web document originally collected from Seed A does not change in any way and is later crawled in the same collection by Seed B, data from the 2nd crawl will count toward your Archive-It account's data budget.
Data de-duplication is URL agnostic meaning a document does not have to have the same URL as the last time it was crawled for the crawler to recognize it as a duplicate.
How to tell if a document was de-duplicated
Crawl reports
You can tell how much data de-duplication occurred during a crawl by comparing the Total and New data and documents columns.
Specific documents discovered to be brand new (never previously collected via that seed) or that have changed since they were last collected will be listed as New in the crawl’s Hosts report. You can access this information by clicking on a hostname in a File Types report and looking at the New column.
CDX
You can see how often a document was de-duplicated by looking at the Mimetype field in a document URLs CDX index. Documents identified as duplicates by Archive-It will have a warc/revisit MIME type, indicating that they reference previous captures in another WARC. New documents will have a specific file/MIME type, for example, text/html, image/jpeg, or application/pdf. See instructions for Accessing Archive-It's Wayback index with the CDX/C API for more information.
Comments
0 comments
Please sign in to leave a comment.