Scoping / crawling OrgSync sites

Amy Wickner

June 15, 2017 18:20

If anyone has experience crawling OrgSync (http://www.orgsync.com/) pages / sites, I'd love to hear how you made it work. Here's an example: http://orgsync.umd.edu/ and a recent test crawl that doesn't preserve functionality: https://wayback.archive-it.org/2410-test/20170601094218/https://orgsync.umd.edu/browse_student_organizations

0

Comments

1 comment

Karl Blumenthal June 22, 2017 16:55

Hi, Amy!

Thanks for bringing OrgSync up -- I hadn't seen it before but I bet that a great many partners might currently or will soon have this service in their scopes.

In your case, the OrgSync instance that you want to crawl is all embedded from a different host; you can see the original with all of its functionality here: http://umd.orgsync.com/ We might therefore be able to scope-in more material from this host, but I found it easiest myself to just crawl it as its own seed, producing the functioning result here: https://wayback.archive-it.org/6489-test/20170620232451/http://umd.orgsync.com/

This approach effectively puts all umd.orgsync.com-hosted documents in scope and even better puts our Heritrix-helper technology Umbra to work where we need it to capture that interactive functionality.

Like all "helper" seeds, this new one could be marked "private" in the web app if you prefer to not display it for front-end access on archive-it.org, but enable users just the same to see its results replay when they navigate to the UMD OrgSync pages page from the main seed in your original crawl.

Let us know if this does the trick or not, though!

0

Comment actions Permalink

Please sign in to leave a comment.