To address these issues, Archive-It has developed Umbra, which leverages its proven Heritrix crawling infrastructure by allowing Heritrix to access sites in the same way a browser would.
How does Umbra work with Heritrix?
Heritrix performs the same functions in this architecture as it did previously, including writing content to WARCs, deduplicating documents, applying modify crawl scope rules set by the user, and generating reports. Crawls performed with Umbra should not take significantly longer than those executed by Heritrix alone.
As of June 5, 2014, all seed URLs in Archive-It will be crawled using Umbra and Heritrix. Continued development through the 5.0 release will improve capture of specific social media sites.
For details on the best way to archive these sites, please see our documentation on Archiving Social Media Sites.