De-duping seeds

September 07, 2016 20:40

I'm trying to find a method for identifying all the existing URLs across several collections to avoid crawling them multiple times as individual seeds.

The 3.6 release notes (I don't see it in any more recent user guides) states that Archive It will de-dupe newly added seeds within a collection and across all collections. In my experience this seems to only be true for de-duping within a collection.

Can someone confirm this should work across all collections? Using the Add Seeds function in both 5.0. and 4.9 I only get the duplicate seed warning within the collection where they exist.

Alternatively, is there a method to bulk download all URLs captured in a collection without having to go through each crawl log?

Thanks!

Comments

3 comments

Karl Blumenthal September 13, 2016 19:22

Hi, John.

The 5.0 web application should check for duplication across all of your account’s collections as you add new seeds to any one given collection. While it will not halt you from adding a seed if you ultimately want it to appear in multiple collections, it will provide a warning.

See for instance what happens when I add this new seed to one of the Archive-It Demo Account’s collections:

Mouse over that little orange “X” indicator next to the seed here and the web app alerts us to the fact that this seed is already part of another collection in the same account:

However, if you would like to reference them before getting to this stage, you can always also download a complete list of the seeds in each collection with the “Download Seed List” link:

I hope this clears things up, but let us know if you don’t see the same on your end, of course!

0

Comment actions Permalink
John Rees September 14, 2016 15:32

Thanks for the response! Seems I mis-characterized my issue. What I'm after is identifying documents/data to prevent duping.

For example: Using the search function, I know https://www.nlm.nih.gov/exhibition/emotions/additional.html (not a seed, but a document) exists in Collection A. Given other known document urls, other than entering them individually as search terms, how can I discover if they've already been crawled (is there bulk access to all the crawl logs across my account that I could interrogate)?

Or does AI's data de-duplucation algorithm figure this out for me?

Or AI can't do this at all?

0

Comment actions Permalink
Karl Blumenthal September 29, 2016 19:12

In your example, yes, AIT would take care of the de-duplication for you in “Collection A.” Unless it changes between crawling periods, a URL will not be archived again and the data will not be added to your budget.

This de-duplication does not, however, occur *across* collections. All documents belong in one collection or another, so if a URL hasn’t been crawled in, say, Collection ‘B’, it will not be de-duped against Collection A. It will be new to the collection and therefore unique, crawled, archived, and added to your budget.

In other words, I think that the best way to find out which URLs have or haven’t been archived is to crawl your seeds :-) event just as a test if you need a quick impression. The new URLs will be archived, and the old ones won’t, but you can see full lists of both in your report’s “New” and “Total” documents columns.

Let me know if this still misses the target, though!

0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?