Archive-It's Crawlers
Crawlers are software that identify materials on the live web that belong in your collections, based upon your choice of seeds and scope. A crawler refers to the automated agent, also called a robot or spider, that capture the live web content. In Archive-It, you will have the option of choosing between two different crawling technologies, Standard and Brozzler.
Standard
Crawls that are run using Archive-It's Standard crawler actually incorporate two different technologies, the Heritrix web crawler and Umbra (both described below).
Heritrix
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler and has been widely used by many different organizations for nearly 2 decades.
When you initiate a Standard crawl in Archive-It, Heritrix crawls all seeds in the crawl simultaneously. Crawls of hundreds of seeds can be slower than crawls of fewer seeds, as Heritrix cycles through all the hosts from all the seed sites and embedded content. There are some characteristics of sites that can prevent Heritrix from crawling. For example, a robots.txt file is a tool used to direct a web crawler (not just ours, but any crawler) not to crawl all or specified parts of a website. Other sites may block traffic from specific IP ranges or only allow specific IP ranges to access content. Some sites use technological means to tell crawlers to wait a specific number of seconds between requests from their site, which impacts the speed with which a site is archived.
In general, Heritrix is able to capture what is available on the live web. If a site is experiencing issues on the live web, such as issues with its SSL certificate, server problems, etc, that will impact Heritrix’s ability to capture content.
Umbra
Websites have become increasingly reliant on client-side scripts to render pages and to ensure an optimal viewing experience to users. These changes are especially apparent in social media sites where the content used to construct a single page is not immediately delivered to a user's browser but is dynamically rendered based on user actions. For example, as a user navigates within a Facebook page, content is delivered on demand through JavaScript when the user scrolls to an un-viewed section of their timeline. Displaying content dynamically through client script allows sites to optimize the user experience and reduce the load on their servers. These optimizations, however, make it difficult for Heritrix to discover resources that are necessary for optimal capture and display of archived content.
To address these issues, Archive-It has developed Umbra, which leverages its proven Heritrix crawling infrastructure by allowing Heritrix to access sites in the same way a browser would.
How does Umbra work with Heritrix?
In Archive-It the initial pages of each seed are sent to Umbra by Heritrix, in a separate process that mimics the way a browser would access the seed URLs. This allows client-side script to be executed so that previously unavailable URLs can be detected for Heritrix to crawl. Umbra also gives Heritrix some abilities to imitate human interactions with Web sites that were previously not possible, such as executing JavaScript through clicking or hovering the mouse over different Web page elements and scrolling down a page.
Heritrix performs the same functions in this architecture as it did previously, including writing content to WARCs, deduplicating documents, applying modify crawl scope rules set by the user, and generating reports.
Brozzler
While many sites can be archived using the Standard crawling technology (Heritrix and Umbra) without issue, some content that can still be challenging to archive. Our newest crawling technology, Brozzler, has improved capture of dynamic and multimedia web content. Brozzler gets its name because it combines the technologies of a browser and a crawler. It differs from Heritrix and other crawling technologies in its reliance on an actual web browser to render and interact with web content before all of that content is indexed and archived into WARC files. Instead of following hyperlinks and downloading files, Brozzler records interactions between servers and web browsers as they occur, more closely resembling how a human user would experience the web resources they want to archive. It also uses youtube-dl to enhance media capture capabilities. For more information on how this process works, and the related open-source tools on which it relies, Brozzler’s code and technical documentation can be found in its GitHub repository.
Comments
1 comment
Thanks for the article- really great overview, looking forward to hearing more about Brozzler!
Please sign in to leave a comment.