I have been running test crawls on a number of seeds that contain links to YouTube videos that I'd like to capture, but I am running into a bit of a snag. By adjusting the seed type to Standard+, or scoping to include youtube or googlevideo links, I can capture these videos, but it seems to capture each video three times. My question, then, is twofold:
1) If I were to save this test crawl, would the de-dupe recognize that the video was captured once and ignore the remaining? (e.g. captured file types video/webm equal 3.8 GB, but I only need 1.2 GB of the data, the remainder being triplicate videos)
2) If the de-dupe doesn't filter the extra captures, is there a particular scoping recipe anyone has that would eliminate this duplication? I've tried to figure out whether the links include any common features (regularly occurring text that would distinguish each), but I cannot see any predictors.
I just wanted to make sure before potentially wasting our budget on useless data.
Thanks in advance for any advice!
Please sign in to leave a comment.