Duplicate crawls in public-facing collection

March 13, 2017 23:52

If you go to https://archive-it.org/collections/5785 and check out the collection of The Aggie student newspapers from the UC Davis archives, starting May 3, 2015 (sporadically) and May 19, 2016 (every subsequent week), crawls show up twice-- same exact date and time. The sporadic ones have slightly different crawl times-- up to an hour or so, for the most part.

When we go to the backend, we can't find any reason why this would be-- there's only one seed, and the crawling history dates don't correspond to the dates of the collection. However, most of the crawls with duplicates have a status of "Finished: Time Limit" rather than just "Finished." I couldn't find anything unique about the few that actually "finished," as their new capture data amount vs. the total data captured varies. They're all weekly, with the only crawl limit being the time limit of 3 days, with a bunch of collection rules, but no seed rules.

Why would these be showing up twice, only in one location?

Comments

5 comments

Sarah Weissman March 15, 2017 13:11

I have noticed this too with recent test crawls, that sometimes I get two capture links for some crawls. I haven't carefully compared the two captures, but in at least one case the earlier capture appears to have some missing images. It's confusing, especially when I'm running a bunch of test crawls, to have extra captures show up and to try to figure out which one I should be looking at. I hope this gets fixed.

0

Comment actions Permalink
Mary Haberle March 15, 2017 21:15
Great question, Meredith! In short, multiple entries appear on the same calendar page in the following circumstances:
- When variants of the same URL are captured (i.e. http/https, www/non-www, ending slash/no ending slash)
- When a URL is re-crawled because our Umbra technology has detected component URLs initially missed by Heritrix
In the case of UC Davis and the other organizations who took part in the University of California’s Web Archiving Service (WAS), the history is a little more complicated.

Your institution’s Archive-It crawling history for http://theaggie.org/ begins June 18, 2015. However, there are captures prior to this date and up until September 17, 2015 which were collected using WAS and later migrated to Archive-It. This is why you see some links on your calendar page that do not correlate to Archive-It crawling history dates. It also accounts for some of the duplicate calendar dates; for instance, one of the September 10, 2015 captures was crawled via Archive-It and the other via WAS.

Although the calendar page does sometimes show multiple links for the same day for the aforementioned reasons, it is important to note that our "URL agnostic" de-duplication process protects your data budget. Unchanged content that has already been archived will not be archived again by our crawler; only new data collected by your crawls is applied against your annual data budget.

Additional Resources:

You can read more about our crawling technology, including the interaction between Heritrix and Umbra here.

And, this page from our FAQs is also relevant to your question.
1

Comment actions Permalink
Mary Haberle March 15, 2017 21:17

Hi Sarah, thanks for sharing your feedback. Typically there aren't display discrepancies between captures of the same URL on the same day, except in the case of sites where the crawler has to log-in. If you could submit a ticket with some examples I'd be happy to look into why you see otherwise.

0

Comment actions Permalink
Meredith Sweet March 15, 2017 22:46

Hi Mary,

Thanks for getting back to me with such a detailed answer!

One interesting thing I've noticed is that for some of the duplicate crawls (let's take May 19, 2016, as an example), the URLs are different, which would indicate some gap in time between the crawls. However, when I click on the link for the second crawl, it redirects to the first crawl!

The first May 19, 2016 crawl has a URL of http://wayback.archive-it.org/5785/20160519170149/https://theaggie.org/, a timestamp of 17:01:49. However, the second URL is http://wayback.archive-it.org/5785/20160519184400/http://theaggie.org/, which should indicate a time stamp of 18:44:00 (and yes, a difference between http and https), and yet when you click on it, the timestamp is still 17:01:49... because the URL has changed to the first one!

This isn't always the case; one of the more recent crawls (March 9, 2017) appears three times. The first two are https crawls, while the third is http, but they all have different timestamps. Every single crawl for this collection has the * on it, so as a guide indicating that something is changed, it's not very helpful.

So did something actually get missed by Heritrix, or not? Should the Crawl Reports indicate what was missed (or the Missing URLs)?

0

Comment actions Permalink
Mary Haberle March 16, 2017 16:33

Meredith, excellent follow-up questions!

Sometimes, as in the case of your May 19, 2016 example, because the https://theaggie.org/ site includes at least one link to the HTTP version of that URL, our crawler attempted to capture it, even though there is no content on the HTTP page and it exists simply to redirect site visitors to the valid HTTPS URL. Thanks to our de-duplication process, the crawler redirects to its original capture of https://theaggie.org/ because it will not re-crawl the same URL it has already captured unless missing content is detected by Umbra.

The asterisk is most helpful when reviewing captures that happened on different days. For example, if an asterisk appeared beside your May 19, 2016 seed, but not beside the next capture date (May 26, 2016) that would tell you at a glance that the site hadn’t changed since it was crawled the prior week. The California Agee updates frequently, as demonstrated by the fact that there is an asterisk beside every capture, so it isn’t the best seed to demonstrate the advantage of this convention.

As I explained in response to Sarah’s comment, except in cases where the crawler is logging in to a password protected site to access content, every capture link for the same URL on the same date should display identically in Wayback.

The two HTTPS captures that you referenced from March 9, 2017 are indeed an example of where Umbra detected URLs missed by Heritrix’s first crawl of https://theaggie.org/. Once Umbra detected the new content, it sent the information to Heritrix, which prompted it to re-crawl the URL.

Crawl reports don’t distinguish between URLs detected by Heritrix and those passed to Heritrix by Umbra, but everything detected was crawled unless your crawl stopped due to a data, document, or time limit. In those cases, the URLs in the Queued column of your host report were next in line to be crawled.

0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?