Archive-It crawlers use a tool called yt-dlp, a modern fork of youtube-dl, to collect embedded audio and video content during web crawls. This article explains what yt-dlp is and how Archive-It uses it.
On this page:
- Background
- What is youtube-dl/yt-dlp
- How does Archive-It use yt-dlp
- How to recognize yt-dlp captures in your crawl reports
Background
Youtube-dl was first introduced in Archive-It with Brozzler in February 2020. As of January 2024, yt-dlp now runs on both Standard and Brozzler crawls.
What is youtube-dl/yt-dlp
Youtube-dl is an open-source software for downloading video and other media content from websites, and yt-dlp is an updated version of the same software. Many platforms use these tools to download and manage audio and video content. Individuals can also use them directly via the command line.
Yt-dlp is actively maintained and developed by a community of volunteer contributors. While Archive-It crawlers use yt-dlp, it is not directly maintained by the Internet Archive or Archive-It. The Archive-It team regularly updates versions of yt-dlp as they become available.
Read more about yt-dlp at https://github.com/yt-dlp/yt-dlp.
How does Archive-It use yt-dlp
Archive-It’s crawlers use yt-dlp to collect embedded audio and video content.
Collecting media
Archive-It’s crawlers run yt-dlp on each page they visit. If the page is yt-dlp compliant and yt-dlp detects embedded audio or video, it downloads those media files. In addition to the media files, yt-dlp also creates a JSON metadata file summarizing what was collected on that page.
The media files and metadata are stored with other archived content in WARC files.
Replaying media
When replaying archived pages in Wayback, Archive-It uses the yt-dlp-generated JSON metadata to match each web page with the media content that was downloaded from it. This helps ensure the correct audio or video is replayed with the corresponding page.
How to recognize yt-dlp captures in your crawl reports
When yt-dlp runs during a crawl, you’ll see youtube-dl entries listed in the Hosts report. These entries take two main forms, depending on what was collected:
Replayable media files
Format:
youtube-dl:00001:https://players.brightcove.net/1155968404/r1WF6V0Pl_default/index.html?videoId=6195835471001
What it means:
This entry represents an actual media file (video or audio) downloaded by yt-dlp. The number (e.g., 00001) helps distinguish between multiple media items from the same page. These files are usually replayable in Wayback.
Metadata Records
Format:
youtube-dl:https://www.politico.com/news/magazine/2020/09/30/trump-biden-debate-roundup-423475
What it means:
This is a JSON metadata file generated by yt-dlp for the HTML page listed in the URL. It helps Wayback associate the correct media files with the page, but it does not contain playable media itself.
Comments
0 comments
Please sign in to leave a comment.