Capturing external links
Hello all, I'm fairly new to archive it and I've mostly been doing pdfs with no external links.
I'm currently trying to crawl this pdf https://www.gov.nl.ca/jps/files/prosecutions-pp-guide-book.pdf which links to other pdfs. I want to capture this document and the ones that it links to so all links are archived.
I've tried One Page+ and Standard+ but neither seem to archive the other pdfs. Maybe it's something with the structure of the document?
Is there a way to accomplish this?
Thanks for the help.
-
Danielle,
I haven't tried this but I suspect that the + seed types don't detect links within pdfs. They probably need the links to be on an html page to do that. The only way I can think to do it would be to crawl each of the pdfs as seeds. If you did that, you could still make just the main document public and the system should have no problem finding the other seeds, though that would depend on the system's ability to re-write the links to keep them within the archive. The Archive-It help desk could probably give you a definitive answer on whether or not this would work.
Best,
Skip Kendall
Please sign in to leave a comment.
Comments
2 comments