De-duping seeds

Comments

3 comments

  • Avatar
    Karl Blumenthal

    Hi, John.

    The 5.0 web application should check for duplication across all of your account’s collections as you add new seeds to any one given collection. While it will not halt you from adding a seed if you ultimately want it to appear in multiple collections, it will provide a warning.

    See for instance what happens when I add this new seed to one of the Archive-It Demo Account’s collections:

    Mouse over that little orange “X” indicator next to the seed here and the web app alerts us to the fact that this seed is already part of another collection in the same account:



    However, if you would like to reference them before getting to this stage, you can always also download a complete list of the seeds in each collection with the “Download Seed List” link:


    I hope this clears things up, but let us know if you don’t see the same on your end, of course!

    0
    Comment actions Permalink
  • Avatar
    John Rees

    Thanks for the response! Seems I mis-characterized my issue. What I'm after is identifying documents/data to prevent duping.

    For example: Using the search function, I know https://www.nlm.nih.gov/exhibition/emotions/additional.html (not a seed, but a document) exists in Collection A. Given other known document urls, other than entering them individually as search terms, how can I discover if they've already been crawled (is there bulk access to all the crawl logs across my account that I could interrogate)?

    Or does AI's data de-duplucation algorithm figure this out for me?

    Or AI can't do this at all?

    0
    Comment actions Permalink
  • Avatar
    Karl Blumenthal

    In your example, yes, AIT would take care of the de-duplication for you in “Collection A.” Unless it changes between crawling periods, a URL will not be archived again and the data will not be added to your budget.

    This de-duplication does not, however, occur *across* collections. All documents belong in one collection or another, so if a URL hasn’t been crawled in, say, Collection ‘B’, it will not be de-duped against Collection A. It will be new to the collection and therefore unique, crawled, archived, and added to your budget.

    In other words, I think that the best way to find out which URLs have or haven’t been archived is to crawl your seeds :-) event just as a test if you need a quick impression. The new URLs will be archived, and the old ones won’t, but you can see full lists of both in your report’s “New” and “Total” documents columns.

    Let me know if this still misses the target, though!

    0
    Comment actions Permalink

Please sign in to leave a comment.