Why Umbra?
Web sites have become increasingly reliant on client-side scripts to render pages and to ensure an optimal viewing experience to users. These changes are especially apparent in social media sites such as Facebook where the content used to construct a single page is not immediately delivered to a user's browser but is dynamically rendered based on user actions. For example, as a user navigates within a Facebook page, content is delivered on demand through JavaScript when the user scrolls to an un-viewed section of their timeline. Displaying content dynamically through client script allows sites to optimize the user experience and reduce the load on their servers. These optimizations, however, make it difficult for Heritrix to discover resources that are necessary for optimal capture and display of archived content.
To address these issues, Archive-It has developed Umbra, which leverages its proven Heritrix crawling infrastructure by allowing Heritrix to access sites in the same way a browser would.
How does Umbra work with Heritrix?
When a crawl is run using Umbra, designated seeds are sent by Heritrix to a separate process that mimics the way a browser would access the seed URLs. This allows client-side script to be executed so that previously unavailable URLs can be detected for Heritrix to crawl. Umbra also gives Heritrix a flexible way to imitate human interactions with Web sites that were previously not possible, such as executing JavaScript through clicking or hovering the mouse over different Web page elements and scrolling down a page.
Heritrix performs the same functions in this architecture as it did previously, including writing content to WARCs, deduplicating documents, applying modify crawl scope rules set by the user, and generating reports. Crawls performed with Umbra should not take significantly longer than those executed by Heritrix alone.
Release Date
As of June 5, 2014, all seed URLs in Archive-It will be crawled using Umbra and Heritrix. Continued development through the 5.0 release will improve capture of specific social media sites.
For details on the best way to archive these sites, please see our documentation on Archiving Social Media Sites.
Please sign in to leave a comment.