What is it?
Data de-duplication prevents duplicate data from being stored and counting against your data budget twice. As of January 2020, de-duplication occurs at the seed level and applies between Brozzler and Standard crawls.
How does it work?
Data de-duplication occurs at the Seed Level. This means:
- If a live-web document originally captured from Seed A does not change in any way and is crawled again via Seed A, data from the re-visit of that document will not count toward your Archive-It account's data budget a second time.
- If a live-web document originally captured from Seed A does not change in any way and is later crawled in the same collection by Seed B, data from the 2nd crawl will count toward your Archive-It account's data budget.
Data de-duplication is URL agnostic meaning a document does not have to have the same URL as the last time it was crawled for the crawler to recognize it as a duplicate.
How can you tell if a page in Wayback has changed?
The Total and New columns within your crawl reports will give you information about how much of your seed sites have changed between crawls.
On a Wayback Calendar page, an asterix (*) next to a capture date indicates that the document changed since it was last crawled, or was captured via different seeds. If there is no asterisk, then the document has not changed since the previous capture.