Archive-It will not store the same content twice from the same web page. The service uses URL agnostic data-deduplication (commonly referred to as 'de-dupe') in order to ensure that any URL archived–be it a homepage URL, an image file, CSS stylesheet, etc.–will not be archived again if the content has not changed. Even in the initial crawl of a seed URL, resources may exist throughout the site in a duplicate form and you may notice de-duplication in effect while analyzing the number of documents and data archived.
Articles in this section
- Can I run Wayback QA or a patch crawl on a test capture?
- How can I block individual hosts within a domain from archiving?
- What are all these strange sites listed in my hosts report?
- What is the difference between a seed and a host?
- Why does my crawl report tell me that URLs were blocked?
- What is the difference between all and new documents/data?
- What do all the messages in the Status column of my Seeds report mean?
- Why didn't some pages get archived?
- What should I check first in my post crawl reports?
- How do I know what I've crawled?