YouTube not capturing video content
Hi, I'm sure this is a really basic issue, but I don't use Archive-It often and couldn't find an answer in the documentation.
I just ran a test crawl of a YouTube channel with two seeds:
seed: https://www.youtube.com/@lsfoundation/
wayback: https://wayback.archive-it.org/10030-test/20231207152302/https://www.youtube.com/@lsfoundation
seed: https://www.youtube.com/@lsfoundation/videos/
wayback: https://wayback.archive-it.org/10030-test/20231207152337/https://www.youtube.com/@lsfoundation/videos
This was a standard, one-time, public, Brozzler crawl with a 1 day time limit, no data limit, and following the scoping rules suggested by the IA team (ignoring robots.txt; adding regex block to prevent crawling infinite directories). The crawl finished with the status "finished" well within the 1 day time limit (suspiciously fast, actually) on December 7th.
The issues are that it seems like a) none all of the videos were actually captured. We've got thumbnails, but I can't get any of them to playback, even in a new tab and using Chrome. They just don't load and there's a white empty space where the video should be.
And b) any video that dynamically loads as you scroll down https://www.youtube.com/@lsfoundation/videos/ was not captured. If you scroll to the bottom of the page in the wayback machine, you just get an eternal spinny circle. There should be 52 thumbnails/videos and it looks like this crawl only got 30.
Obviously I need to tweak something here, but I can't figure out what it is. Thanks for any suggestions!
-
Aha, thank you. Ok, so no, it looks like we didn't get any video/mp4 files -- but I don't know why or what I should do differently with the next crawl.
For what it's worth, I also recently ran a test crawl of a YouTube seed that we've been crawling for years, with the same rules and settings as this crawl, and that one captured all the new-since-the-last-crawl videos successfully.
-
Hi Sarah - I do think your crawl parameters are correct. Brozzler, and the seed scope rules that get added automatically, are essential. So, the other seed that you've been crawling for years - it's had years of crawls to capture all its videos. And, are the two seeds in question newly capturing all those videos? I think it might just need more time. You can give it 3 days or even a week.
I once had a youtube channel that couldn't be captured because it was incompatible with youtube-dl: (the colon is part of it), which shows up as a host. Your seeds both have youtube-dl: as a host, so that's essential, and good. You might comb through the A-I youtube help article - I do that OFTEN to catch things I've forgotten!
My computer/internet is slow right now so I haven't been able to load your full reports, but comment here again if you try a longer crawl and it doesn't work, and I am happy to look further. I've QA'd a lot of youtube crawls! :) -
Thanks! I set it to run for a full day and it ended 6 hours after it started, so I'm not sure that's the issue. Also, the report just said "Finished" instead of "Finished: Time Limit."
Does that still sound like it just needs more time? I'm happy to just run another test crawl, but I'm not totally convinced that's the issue given that the first one didn't run for a full 24 hours.
-
Thanks for flagging this Sarah! You did indeed configure your crawls correctly. It looks like our A/V collecting utility needs an update to help it collect those missing video files again. I will update this thread again when we see more complete results from YouTube reliably. You can track our progress to that end on our status page for social media and other platforms.
-
Thank you all for your patience on this one -- I think we're back in business! Tests since our latest upgrades are archiving these seeds and videos successfully again, so I recommend re-crawling any that gave you trouble in the last month or so, with the usual recommended scoping and Brozzler option enabled.
Sarah, I also recommend changing your seed URL slightly so that it appears exactly like this (note the lack of a trailing / slash at the end): https://www.youtube.com/@lsfoundation/videos
And please let us know here or directly if you encounter any further obstacles from YouTube.
-
Hi Karl -- I'm still having the same (or at least very similar) issues on new crawls that I ran on February 5th with the same scoping and limitations, plus your suggestion about removing the end slash from the URLs. Still no mp4s showing up in the file type list.
https://wayback.archive-it.org/10030-test/20240205171128/https://www.youtube.com/@lsfoundation
Open to any and all suggestions!
Please sign in to leave a comment.
Comments
9 comments