How our crawlers work
Crawlers identify materials on the live web that belong in your collections, based upon your choice of seeds and scope. A crawl can also reference the archived content associated with the action. A crawler refers to the automated agent, also called a robot or spider, that capture the live web content. Crawling technology refers to technology that helps crawlers capture content.
We use the Heritrix web crawler, and our user agent is archive.org_bot. When you initiate a crawl in Archive-It, Heritrix crawls all seeds in the crawl simultaneously. Crawls of hundreds of seeds can be slower than crawls of fewer seeds, as Heritrix cycles through all the hosts from all the seed sites and embedded content.
There are some characteristics of sites that can prevent Heritrix from crawling. For example, a robots.txt file is a tool used to direct a web crawler (not just ours, but any crawler) not to crawl all or specified parts of a website. Other sites may block traffic from specific IP ranges or only allow or “whitelist” specific IP ranges to access content. Some sites use technological means to tell crawlers to wait a specific number of seconds between requests from their site, which impacts the speed with which a site is archived.
In general, Heritrix is able to capture what is available on the live web. If a site is experiencing issues on the live web, such as issues with its SSL certificate, server problems, etc, that will impact Heritrix’s ability to capture content.
To address these issues, Archive-It has developed Umbra, which leverages its proven Heritrix crawling infrastructure by allowing Heritrix to access sites in the same way a browser would.
How does Umbra work with Heritrix?
Heritrix performs the same functions in this architecture as it did previously, including writing content to WARCs, deduplicating documents, applying modify crawl scope rules set by the user, and generating reports.
While many sites can be archived by Heritrix and Umbra without issue, there is still content that can be challenging to archive. We are currently developing a technology to better capture dynamic and multimedia web content called Brozzler, which gets its name from browser + crawler. It differs from Heritrix and other crawling technologies in its reliance on an actual web browser to render and interact with web content before all of that content indexed and archived into WARC files. Instead of following hyperlinks and downloading files, Brozzler records interactions between servers and web browsers as they occur, more closely resembling how a human user would experience the web resources that they want to archive. For more information on how this process works, and the related open-source tools on which it relies, Brozzler’s code and technical documentation can be found in its GitHub repository.