Saving a test crawl with mix of New Data and Duplicate Data
I have run and then stopped several Test crawls on a Collection with a single seed, adjusting the collection scope each time, because for example some of the earlier ones captured more than I wanted or for some other reason.
Now I have a new Test Crawl that has the scope rules that I want, and I want to save it and make it available.
Subsequent Test crawls indicate a mix of New Data new Duplicate Data. If I merely save the latest Test crawl, and ignore the old crawl that included what is called Duplicate Data in the new crawl, will the latest crawl include the "Duplicate Data" or do I need also to save each of the previous Test crawls if I want to get all of the data stored? I'd hope that I can just save the latest Test Crawl but I obviously don't want to skip over and lose all of the "Duplicate Data"
Thanks in advance!
-
Michael,
Test crawls don't take into account any data captured in other unsaved test crawls. A test crawl will only deduplicate against permanent data in your collection (from production crawls or other saved test crawls), or occasionally against itself (the same data identified in more than one place during a single crawl).
If you're happy with the results of one test crawl and choose to save it, you will be saving all the data you see listed in that crawl's report. The data listed in the "New" column will be the number deducted from your annual data budget.
Raven Germain, Archive-It Staff
Please sign in to leave a comment.
Comments
1 comment