Webpages at one URL - PDF linked docs on a different URL: how to capture both?
Hi there - looking for some assistance with a crawl I'm running. I'm pretty new to scoping crawls so appreciate assistance.
My primary URL for crawling is a University HR program website with a url like: www.mit.edu/hr/program/
There are PDF and images linked to the webpages which come from www.mit.edu/groups/hr/
How can I set up a crawl so that the linked PDF and image files from www.mit.edu/groups/hr are accessible from the www.mit.edu/hr/program/ pages?
thanks for all replies
Kari S.
Digital Archivist, MIT Institute Archives and Special Collections
-
Hi Kari,
If your crawl isn't picking up those pdfs and images, what I usually do is click on the seed you've added to your account, go to the "Seed Scope" tab at the seed level, and then add the url (www.mit.edu/groups/hr/) that hosts the pdfs as an "expand scope to include url if" rule. Definitely run a test crawl with that to make sure you are picking up what you need and not a ton of extra files or images. I would also click on the pdfs and see what the url is for those (I tried clicking on your links but you must have to be logged in). Sometimes I also have to add the first part of the url for the pdfs or images as an additional "expand scope to include url if" rule in order to capture all of them.
Hope that gives you some ideas!
Marissa
Wisconsin Historical Society
Please sign in to leave a comment.
Comments
2 comments