YouTube videos triplicated



  • Official comment
    Sylvie Rollason-Cass

    Hi Jeremy,

    YouTube and Googlevideo content can be a bit tricky. For the most part, you should be able to capture any embedded YouTube videos by following the scoping rules on the YouTube help page. They do not usually require expand scope rules or using a + seed type. Of course the web can be messy, so while this is usually the case there can always be outliers that require a little extra scoping. This is where test crawls come in handy.

    There are two things I want to point out about de-duplication that I hope will help clear a few things up:

    First, it's important to note that a test crawl will de-duplicate against itself (meaning it won't capture the same data twice in one crawl) and against the data in your permanent collection, but not against other unsaved test crawls. The number under New Data in your test crawl is what will be applied to your account should you choose to save it. Second, data de-duplication is URL agnostic, which means the crawler is able to identify duplicate data even if it's being served from two distinct URLs.

    I have seen file type reports for crawls that captured 3 versions of the same webm video, however in my experience 2 of those versions will usually have been de-duplicated. You can tell they're being de-duplicated when you see a 0 in that file’s New Data column. If multiple versions of a few of them are being captured it would be because the data in each version is different enough that the crawler identified it as unique.

    A video report without any de-duplication sounds like a red flag to me. If you're seeing this consistently please consider sending in a support ticket with links to the reports in question.


    Comment actions Permalink
  • Avatar
    Jeremy Heil

    Thank you so much for this response, Sylvie!  I suspected the de-duplication would have prevented it from saving more than once, but I just wanted to be extra certain beforehand.

    This does raise a secondary question, however (this is far less pressing, but I'm still interested in clarifying).  The language of the de-duplication documentation in the User Guide seems to indicate that the process is run against other assets in the collection.  Would this mean that a video captured in one collection (that had been created in one account) could potentially be captured a second time in another collection (created in the same account)? Or am I misreading this, and does the de-dupe run across all assets in the account, crossing the collection boundaries?



    Comment actions Permalink
  • Avatar
    Sylvie Rollason-Cass

    Hi Jeremy, 

    De-duplication currently runs across assets in a single collection. So the answer to your collection is yes. It is possible for the same data to be captured twice in separate collections within a single account and be applied separately to that account's data budget. 



    Comment actions Permalink

Please sign in to leave a comment.