How to avoid crawler traps when archiving YouTube videos

Many partners archive videos on YouTube watch pages, channels, and playlists. On occasion, YouTube crawls run into crawler traps, which is just one of the reasons why we strongly recommend that you run a test crawl when you are first scoping YouTube content. Left to their own devices, crawler traps can eat up your data budget and distract a crawler from the content that you really want to archive.

To identify whether crawler traps are adversely affecting your crawl, check the "Docs" and "Queued" lists for the host "www.youtube.com" in your post-crawl Hosts report:

In particular, keep an eye out for especially high data or document counts and evidence of URLs in these lists with “repeating directories,” as in the example:

https://www.youtube.com/channel/UCrHC0hXTvYewidE9Q5AM_8w/_/im/HTTP/www/www/HTTP/HTTP/www/HTTP/HTTP/www/www/channels

If you do notice repeating directories, you can address this by adding the following regular expression to Block URLs with repeating directories:

^.*?(/.+?/).*?\1.*$|^.*?/(.+?/)\2.*$

You can even add this new rule directly from your Hosts report. Simply select the host "www.youtube.com" from the list, click the “Edit Rules” button, and select the “Block URL if…” rule from the drop-down menu:

Visit our User Guide page on Archiving YouTube for all the latest and most comprehensive guidance on how to add YouTube content to your collections.

How to avoid crawler traps when archiving YouTube videos

Comments

Didn't find what you were looking for?