Scoping googlevideo hosts
I need help figuring out a way to configure rules for googlevideo hosts. Several of our seeds have embedded video players on their sites. We are getting hundreds of GBs of video hosts that are not actual relevant content. Any help would be appreciated.
-
Great question! You can add data or document limits on seeds and hosts. Especially for YouTube seeds, or embedded video, you might take a look at at the content coming in via various googlevideo hosts on a specific crawl, set a data limit below that for the host, and run a test crawl. Please keep in mind that applying limits on a collection level will affect all seeds in that collection. You can read more here: https://support.archive-it.org/hc/en-us/articles/208332933-Limit-your-crawl#Howtolimityourcrawl-Howtolimitthedatacrawledforspecifichosts.
You also have options around what kind of seed type you set for the youtube pages. For example, we recommend setting youtube "watch pages" as "One Page" seeds, to help control the amount of data captured. You can read more here: https://support.archive-it.org/hc/en-us/articles/208332843-Assign-and-edit-a-seed-type-
Please sign in to leave a comment.
Comments
1 comment