Website is at one URL but linked PDF and Images are at different base URL - how to capture both?

Comments

3 comments

  • Avatar
    Carolina Roman Amigo

    Hello Kari,

    You can crawl both seeds and delete the pdf/images seed (www.mit.edu/groups/hr/) after so it doesn't show up on your public collections page. The content will be there and will be accessible from the links at www.mit.edu/hr/program/. You can test it first doing a test crawl and taking a look at the results using the link to the Wayback machine view that will show besides the seeds crawled.

    I hope that helps!

    Carol

  • Avatar
    Karl-Rainer Blumenthal

    Simpler still, I would recommend adding a little "scope expansion" to your future crawls, Kari -- one that explicitly tells our crawling tech to archive the desired PDFs from that groups/hr/ directory even when it would normally rule them "out of scope" based on the seed URL. Using this approach, there would be no need to add any additional seeds. Here's how to do it >>>

    Expand the scope of your crawls to include URLs if they contain the following text: mit.edu/groups/hr/

    You can use the "Crawl Scope" tab do this at the collection level and ensure that any subsequent crawls run in your collection archive the documents in that directory:

    Alternatively, you can click on the specific Seed URL under the Seeds tab, then into the "Seed Scope" tab in order to effect this change only for crawls of that specific seed:

    More background information on why and how to expand the scope of your crawls: https://support.archive-it.org/hc/en-us/articles/208001106-Expand-the-scope-of-your-crawl

    Carolina's right on the money with that test crawl recommendation, too! Launching a test in advance of adding any new data permanently to your collection is a great way to make sure we've taken the right approach before spending any of your data budget, and if need be, revising it. If this is your first time using a test crawl to evaluate a scoping strategy, here's a little illustrated guide: https://support.archive-it.org/hc/en-us/articles/208001226-Run-monitor-and-save-a-test-crawl

    Let us know of course if that does or doesn't do the trick!

  • Avatar
    Kari Smith

    Thank you both very much.  I did run a test crawl with the /hr/program/ URL as my seed. That's how I noticed that the PDFs and Images weren't collected and that they stemmed from a different URL.

    I will try the expanding the scope of the crawl method.  And am interested to try Carolina's suggestion of deleting the seed afterwards so it doesn't show in the public access for the collections.

    Best,

    Kari

Please sign in to leave a comment.