About data de-duplication

On this page:

What is data de-duplication
How data de-duplication works
How to tell if a document was de-duplicated
A/V de-duplication

What is data de-duplication

Data de-duplication prevents duplicate data from being stored and counted against your data budget twice.

How data de-duplication works

While a crawl is running, the de-duplication process checks against Wayback records collected from the same seed to determine whether or not a given document is new.

A document will be considered new if the digest/checksum value in its CDX entry is unique. If a document with the same digest has previously been collected via the same seed, the document will be de-duplicated. When this happens, the crawler doesn't collect the document. Instead, the crawler creates a warc/revisit record that references the last new capture of the document.

Data de-duplication occurs at the Seed Level. This means:

If a live-web document originally collected from Seed A does not change in any way and is collected via Seed A again, data from the re-visit of that document will not count toward your Archive-It account's data budget a second time.
If a live-web document originally collected from Seed A does not change in any way and is later crawled in the same collection by Seed B, data from the 2nd crawl will count toward your Archive-It account's data budget.

Data de-duplication is URL agnostic meaning a document does not have to have the same URL as the last time it was crawled for the crawler to recognize it as a duplicate.

How to tell if a document was de-duplicated

Crawl reports

You can tell how much data de-duplication occurred during a crawl by comparing the Total and New data and documents columns.

Specific documents discovered to be brand new (never previously collected via that seed) or that have changed since they were last collected will be listed as New in the crawl’s Hosts report. You can access this information by clicking on a hostname in a File Types report and looking at the New column.

Screen Shot 2024-07-30 at 2.18.47 PM.png

CDX

You can see how often a document was de-duplicated by looking at the Mimetype field in a document URLs CDX index. Documents identified as duplicates by Archive-It will have a warc/revisit MIME type, indicating that they reference previous captures in another WARC. New documents will have a specific file/MIME type, for example, text/html, image/jpeg, or application/pdf. See instructions for Accessing Archive-It's Wayback index with the CDX/C API for more information.

Screen Shot 2024-07-30 at 2.32.58 PM.png

A/V Deduplication

Audio and video files are currently difficult to deduplicate. This is partly due to how crawlers and tools like yt-dlp collect A/V content, and partly because many platforms serve media dynamically or in segmented formats.
As a result, you may not see deduplication occur for audio or video files. This is expected behavior at this time. Improving A/V deduplication is an active area of exploration for our team, and we are continuing to investigate approaches that could improve how this content is handled in the future.

Articles in this section

What is data de-duplication

How data de-duplication works

How to tell if a document was de-duplicated

Crawl reports

CDX

A/V Deduplication

Comments

Articles in this section

What is data de-duplication

How data de-duplication works

How to tell if a document was de-duplicated

Crawl reports

CDX

A/V Deduplication

Related articles