Website is at one URL but linked PDF and Images are at different base URL - how to capture both?
Hi there - looking for some assistance with a crawl I'm running. I'm pretty new to scoping crawls so appreciate assistance.
My primary URL for crawling is a University HR program website with a url like: www.mit.edu/hr/program/
There are PDF and images linked to the webpages which come from www.mit.edu/groups/hr/
How can I set up a crawl so that the linked PDF and image files from www.mit.edu/groups/hr are accessible from the www.mit.edu/hr/program/ pages in the Wayback?
thanks for all replies
Kari S.
Digital Archivist, MIT Institute Archives and Special Collections
-
Hello Kari,
You can crawl both seeds and delete the pdf/images seed (www.mit.edu/groups/hr/) after so it doesn't show up on your public collections page. The content will be there and will be accessible from the links at www.mit.edu/hr/program/. You can test it first doing a test crawl and taking a look at the results using the link to the Wayback machine view that will show besides the seeds crawled.
I hope that helps!
Carol
-
Simpler still, I would recommend adding a little "scope expansion" to your future crawls, Kari -- one that explicitly tells our crawling tech to archive the desired PDFs from that groups/hr/ directory even when it would normally rule them "out of scope" based on the seed URL. Using this approach, there would be no need to add any additional seeds. Here's how to do it >>>
Expand the scope of your crawls to include URLs if they contain the following text: mit.edu/groups/hr/
You can use the "Crawl Scope" tab do this at the collection level and ensure that any subsequent crawls run in your collection archive the documents in that directory:
Alternatively, you can click on the specific Seed URL under the Seeds tab, then into the "Seed Scope" tab in order to effect this change only for crawls of that specific seed:
More background information on why and how to expand the scope of your crawls: https://support.archive-it.org/hc/en-us/articles/208001106-Expand-the-scope-of-your-crawl
Carolina's right on the money with that test crawl recommendation, too! Launching a test in advance of adding any new data permanently to your collection is a great way to make sure we've taken the right approach before spending any of your data budget, and if need be, revising it. If this is your first time using a test crawl to evaluate a scoping strategy, here's a little illustrated guide: https://support.archive-it.org/hc/en-us/articles/208001226-Run-monitor-and-save-a-test-crawl
Let us know of course if that does or doesn't do the trick!
-
Thank you both very much. I did run a test crawl with the /hr/program/ URL as my seed. That's how I noticed that the PDFs and Images weren't collected and that they stemmed from a different URL.
I will try the expanding the scope of the crawl method. And am interested to try Carolina's suggestion of deleting the seed afterwards so it doesn't show in the public access for the collections.
Best,
Kari
Please sign in to leave a comment.
Comments
3 comments