Capturing external links

Danielle Gillespie

April 14, 2023 13:24

Hello all, I'm fairly new to archive it and I've mostly been doing pdfs with no external links.

I'm currently trying to crawl this pdf https://www.gov.nl.ca/jps/files/prosecutions-pp-guide-book.pdf which links to other pdfs. I want to capture this document and the ones that it links to so all links are archived.

I've tried One Page+ and Standard+ but neither seem to archive the other pdfs. Maybe it's something with the structure of the document?

Is there a way to accomplish this?

Thanks for the help.

0

Comments

2 comments

Skip Kendall April 14, 2023 13:57

Danielle,

I haven't tried this but I suspect that the + seed types don't detect links within pdfs. They probably need the links to be on an html page to do that. The only way I can think to do it would be to crawl each of the pdfs as seeds. If you did that, you could still make just the main document public and the system should have no problem finding the other seeds, though that would depend on the system's ability to re-write the links to keep them within the archive. The Archive-It help desk could probably give you a definitive answer on whether or not this would work.

Best,

Skip Kendall

0

Comment actions Permalink
Danielle Gillespie April 14, 2023 15:12

Thanks Skip, honestly that's what I was fearing. I'll check with the help desk and see what they recommend.

0

Comment actions Permalink

Please sign in to leave a comment.