What is it?
Data de-duplication prevents duplicate data from being stored and counting against your data budget twice. As of January 2020, de-duplication occurs at the seed level and applies between Brozzler and Standard crawls.
How does it work?
Data de-duplication occurs at the Seed Level. This means:
- If a live-web document originally captured from Seed A does not change in any way and is crawled again via Seed A, data from the re-visit of that document will not count toward your Archive-It account's data budget a second time.
- If a live-web document originally captured from Seed A does not change in any way and is later crawled in the same collection by Seed B, data from the 2nd crawl will count toward your Archive-It account's data budget.
Data de-duplication is URL agnostic meaning a document does not have to have the same URL as the last time it was crawled for the crawler to recognize it as a duplicate.
How can you tell if a page in Wayback has changed?
The Total and New columns within your crawl reports will give you information about how much of your seed sites have changed between crawls.