You can add YouTube videos or channels to your collection in order to crawl, archive, and replay them as you would any other seed site, just so long as you remember to format and scope them according to a few simple rules:
On this page
General scoping rules for YouTube
You can set up your crawls to archive videos from YouTube watch pages, channel pages, or embedded videos in other sites, by adding a few simple scope modifications at either the Collection or Seed level.
YouTube blocks important page styling content and some video files with robots exclusions. To make sure that you are able to capture the look and feel of a YouTube page and/or any video content, you will therefore need to add the rules listed below.
If you wish to archive YouTube videos linked or hosted by any site in the course of your crawl, you must first modify your collection's crawl scope to ignore robots.txt files from the following hosts, exactly as they appear here:
- youtube.com [note: no www.] - This is one of the hosts that serves YouTube video files.
- googlevideo.com [note: no www.] - This is the primary host that serves YouTube video files.
- OR -
If you prefer to archive only those videos and video pages from the specific YouTube seed that you added to your collection: Ignore robots.txt for each seed.
Host level data limits apply individually to each subdomain of that host. Because googlevideo files are served from a large number of individual subdomains, adding one rule to limit the amount of Googlevideo content often isn't enough. For example, a 1GB data limit on the host googlevideo.com could result in 1GB from r5---sn-n4v7sn7s.googlevideo.com, 1GB from r3---sn-n4v7sn7s.googlevideo.com, and so on.
The easiest way to limit the amount of googlevideo content is to identify the seeds from which googlevideos are being captured, and add seed level data limits to them. The size of the data limit will vary depending on your seed, so be sure to run test crawls to make sure you're capturing what you need.
Sometimes, YouTube crawls can run into crawler traps and archive invalid URLs with seemingly endless combinations of repeating directories. Running a test crawl on YouTube seeds will allow you to consult the "Docs" and "Queued" lists for the host www.youtube.com in your Hosts report to determine whether it is crawling URLs with repeating directories, as in the example:
If you find a large number of similar documents in your crawl you can address this in the seed scope by adding the following repeating directories regular expression to Block URL if it matches the regular expression:
How to format YouTube seeds
Specific videos on YouTube are hosted on a "watch" page with a URL in the following format: https://www.youtube.com/watch?v=XXXXXXXXXX/. The most effective way to archive this kind of video is to use the One Page seed type.
YouTube channels are topic-specific groups of videos and related content. For example, the University of Melbourne's channel can be accessed at https://www.youtube.com/user/unimelb/. This URL can serve as your seed URL. However, when you wish to archive a user's YouTube channel in its entirety, we further recommend adding an additional seed URL for its "Videos" tab, which is formatted as:This enables our crawler to access all videos uploaded to the user's account.
The Standard seed type is best for crawling the videos linked off of a channel page, however a test crawl is strongly recommended when crawling a channel for the first time. Depending on the number of videos, these seeds can be very data heavy. Consider using the test crawl to determine how much data you should allot to these seeds then limit your crawl by adding a data limit at the seed level.
Playlists are specific lists of videos curated by a user from among their account's and/or other videos on YouTube. The URL for each playlist can serve as a seed, and is typically formatted as:
To archive the playlist itself and the videos to which it links, we recommend adding each playlist to your collection as its own seed and using the One Page Plus External Links (One Page+) seed type.
YouTube search pages like / are best crawled using the One Page Plus External Links (One Page +) seed type. We strongly recommend a test crawl when crawling search pages. As with channels, you may need to add a seed level data limit in order to avoid crawling excessive additional videos. "william+shakespeare"
Videos that are embedded into other sites should archive successfully as long as the "general scoping rules" above have been applied to that site's crawl.
How to replay archived YouTube videos
Individual videos archived in the manner described above should play back within the page. Any video that was captured from that page will also be accessible via the "Videos" link in the Wayback banner.
Pages with multiple embedded YouTube videos will only be able to load and replay the first video listed. All other captured videos will, however, still be accessible though the "Videos" link in the Wayback banner, and likewise from the tab listing videos archived in your publicly accessible collections.
Please note that the crawler will only access one version of a video during a crawl. If a video exists in multiple places within a seed site, it will only play back from the location it was discovered by the crawler initially.