Duplicate crawls in public-facing collection



  • Avatar
    Sarah Weissman

    I have noticed this too with recent test crawls, that sometimes I get two capture links for some crawls. I haven't carefully compared the two captures, but in at least one case the earlier capture appears to have some missing images. It's confusing, especially when I'm running a bunch of test crawls, to have extra captures show up and to try to figure out which one I should be looking at. I hope this gets fixed.

  • Avatar
    Mary Haberle

    Great question, Meredith! In short, multiple entries appear on the same calendar page in the following circumstances:

    • When variants of the same URL are captured (i.e. http/https, www/non-www, ending slash/no ending slash)
    • When a URL is re-crawled because our Umbra technology has detected component URLs initially missed by Heritrix

    In the case of UC Davis and the other organizations who took part in the University of California’s Web Archiving Service (WAS), the history is a little more complicated.

    Your institution’s Archive-It crawling history for http://theaggie.org/ begins June 18, 2015. However, there are captures prior to this date and up until September 17, 2015 which were collected using WAS and later migrated to Archive-It. This is why you see some links on your calendar page that do not correlate to Archive-It crawling history dates. It also accounts for some of the duplicate calendar dates; for instance, one of the September 10, 2015 captures was crawled via Archive-It and the other via WAS.

    Although the calendar page does sometimes show multiple links for the same day for the aforementioned reasons, it is important to note that our "URL agnostic" de-duplication process protects your data budget. Unchanged content that has already been archived will not be archived again by our crawler; only new data collected by your crawls is applied against your annual data budget.

    Additional Resources:

    You can read more about our crawling technology, including the interaction between Heritrix and Umbra here.

    And, this page from our FAQs is also relevant to your question.

  • Avatar
    Mary Haberle

    Hi Sarah, thanks for sharing your feedback. Typically there aren't display discrepancies between captures of the same URL on the same day, except in the case of sites where the crawler has to log-in. If you could submit a ticket with some examples I'd be happy to look into why you see otherwise.

  • Avatar
    Meredith Sweet

    Hi Mary,

    Thanks for getting back to me with such a detailed answer!

    One interesting thing I've noticed is that for some of the duplicate crawls (let's take May 19, 2016, as an example), the URLs are different, which would indicate some gap in time between the crawls. However, when I click on the link for the second crawl, it redirects to the first crawl! 

    The first May 19, 2016 crawl has a URL of http://wayback.archive-it.org/5785/20160519170149/https://theaggie.org/, a timestamp of 17:01:49. However, the second URL is http://wayback.archive-it.org/5785/20160519184400/http://theaggie.org/, which should indicate a time stamp of 18:44:00 (and yes, a difference between http and https), and yet when you click on it, the timestamp is still 17:01:49... because the URL has changed to the first one!

    This isn't always the case; one of the more recent crawls (March 9, 2017) appears three times. The first two are https crawls, while the third is http, but they all have different timestamps. Every single crawl for this collection has the * on it, so as a guide indicating that something is changed, it's not very helpful.

    So did something actually get missed by Heritrix, or not? Should the Crawl Reports indicate what was missed (or the Missing URLs)? 

  • Avatar
    Mary Haberle

    Meredith, excellent follow-up questions!

    Sometimes, as in the case of your May 19, 2016 example, because the https://theaggie.org/ site includes at least one link to the HTTP version of that URL, our crawler attempted to capture it, even though there is no content on the HTTP page and it exists simply to redirect site visitors to the valid HTTPS URL. Thanks to our de-duplication process, the crawler redirects to its original capture of https://theaggie.org/ because it will not re-crawl the same URL it has already captured unless missing content is detected by Umbra.

    The asterisk is most helpful when reviewing captures that happened on different days. For example, if an asterisk appeared beside your May 19, 2016 seed, but not beside the next capture date (May 26, 2016) that would tell you at a glance that the site hadn’t changed since it was crawled the prior week. The California Agee updates frequently, as demonstrated by the fact that there is an asterisk beside every capture, so it isn’t the best seed to demonstrate the advantage of this convention.

    As I explained in response to Sarah’s comment, except in cases where the crawler is logging in to a password protected site to access content, every capture link for the same URL on the same date should display identically in Wayback.

    The two HTTPS captures that you referenced from March 9, 2017 are indeed an example of where Umbra detected URLs missed by Heritrix’s first crawl of https://theaggie.org/. Once Umbra detected the new content, it sent the information to Heritrix, which prompted it to re-crawl the URL.

    Crawl reports don’t distinguish between URLs detected by Heritrix and those passed to Heritrix by Umbra, but everything detected was crawled unless your crawl stopped due to a data, document, or time limit. In those cases, the URLs in the Queued column of your host report were next in line to be crawled.

Please sign in to leave a comment.