I'm trying to find a method for identifying all the existing URLs across several collections to avoid crawling them multiple times as individual seeds.
The 3.6 release notes (I don't see it in any more recent user guides) states that Archive It will de-dupe newly added seeds within a collection and across all collections. In my experience this seems to only be true for de-duping within a collection.
Can someone confirm this should work across all collections? Using the Add Seeds function in both 5.0. and 4.9 I only get the duplicate seed warning within the collection where they exist.
Alternatively, is there a method to bulk download all URLs captured in a collection without having to go through each crawl log?
Please sign in to leave a comment.