YouTube videos triplicated

December 02, 2016 15:43

Greetings all,

I have been running test crawls on a number of seeds that contain links to YouTube videos that I'd like to capture, but I am running into a bit of a snag. By adjusting the seed type to Standard+, or scoping to include youtube or googlevideo links, I can capture these videos, but it seems to capture each video three times. My question, then, is twofold:

1) If I were to save this test crawl, would the de-dupe recognize that the video was captured once and ignore the remaining? (e.g. captured file types video/webm equal 3.8 GB, but I only need 1.2 GB of the data, the remainder being triplicate videos)

2) If the de-dupe doesn't filter the extra captures, is there a particular scoping recipe anyone has that would eliminate this duplication? I've tried to figure out whether the links include any common features (regularly occurring text that would distinguish each), but I cannot see any predictors.

I just wanted to make sure before potentially wasting our budget on useless data.

Thanks in advance for any advice!

Cheers,

Jeremy

Comments

3 comments

Official comment

Sylvie Rollason-Cass December 07, 2016 17:17

Hi Jeremy,

YouTube and Googlevideo content can be a bit tricky. For the most part, you should be able to capture any embedded YouTube videos by following the scoping rules on the YouTube help page. They do not usually require expand scope rules or using a + seed type. Of course the web can be messy, so while this is usually the case there can always be outliers that require a little extra scoping. This is where test crawls come in handy.

There are two things I want to point out about de-duplication that I hope will help clear a few things up:

First, it's important to note that a test crawl will de-duplicate against itself (meaning it won't capture the same data twice in one crawl) and against the data in your permanent collection, but not against other unsaved test crawls. The number under New Data in your test crawl is what will be applied to your account should you choose to save it. Second, data de-duplication is URL agnostic, which means the crawler is able to identify duplicate data even if it's being served from two distinct URLs.

I have seen file type reports for crawls that captured 3 versions of the same webm video, however in my experience 2 of those versions will usually have been de-duplicated. You can tell they're being de-duplicated when you see a 0 in that file’s New Data column. If multiple versions of a few of them are being captured it would be because the data in each version is different enough that the crawler identified it as unique.

A video report without any de-duplication sounds like a red flag to me. If you're seeing this consistently please consider sending in a support ticket with links to the reports in question.

-Sylvie

Comment actions Permalink
Jeremy Heil December 08, 2016 16:08

Thank you so much for this response, Sylvie! I suspected the de-duplication would have prevented it from saving more than once, but I just wanted to be extra certain beforehand.

This does raise a secondary question, however (this is far less pressing, but I'm still interested in clarifying). The language of the de-duplication documentation in the User Guide seems to indicate that the process is run against other assets in the collection. Would this mean that a video captured in one collection (that had been created in one account) could potentially be captured a second time in another collection (created in the same account)? Or am I misreading this, and does the de-dupe run across all assets in the account, crossing the collection boundaries?

Cheers,

Jeremy

0

Comment actions Permalink
Sylvie Rollason-Cass December 14, 2016 23:52

Hi Jeremy,

De-duplication currently runs across assets in a single collection. So the answer to your collection is yes. It is possible for the same data to be captured twice in separate collections within a single account and be applied separately to that account's data budget.

-Sylvie

0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?