Brozzler is our newest crawling technology, built at the Internet Archive.
Brozzler differs from Archive-It's "Standard" crawling technology (Heritrix and Umbra) in its reliance on an actual web browser to interact with web content before that content is indexed and archived into WARC files. Instead of following hyperlinks and downloading files, Brozzler records interactions between servers and web browsers as they occur, more closely resembling how a human user would experience the web. It also uses youtube-dl to enhance media capture capabilities. (as of January 2024 both Brozzler and Standard crawls use youtube-dl).
For more information on how this process works, and the related open-source tools on which it relies, you can review Brozzler’s code and technical documentation in its GitHub repository.
Find out more: How and when to use Brozzler
Comments
0 comments
Please sign in to leave a comment.