Workflow for downloading all PDFs from a PDF only crawl

Comments

2 comments

  • Avatar
    Karl Blumenthal (Edited )

    Hi Adriane,

    A couple of options to start with:

    • There are a number of different browser extensions out there that perform the kind of bulk downloading that you describe, but I've used DownloadThemAll most of the time and it's met my needs. It should enable you to download all of the linked PDFs when you have your file type report's list of them in front of you. Or,

    • If you can run wget locally, then you can direct it to download that full list of PDF files. Just download the list from the report, find/replace to add the Archive-It Wayback prefix onto the front of each,* and use wget's -i flag in order to point it towards your list of download URLs.

    Let us know how it goes, though!

    * The "Archive-It Wayback prefix" is everything before the original live PDF's URL in an example like this one: https://wayback.archive-it.org/12264/3/http://www.adasoutheast.org/webinars/2018/disabilityhx/020818-ppt.pdf. It includes the Wayback domain, the Archive-It collection number, and a timestamp, the latter of which can be abbreviated to "3" in these cases in order to redirect to the most recent capture.

    0
    Comment actions Permalink
  • Avatar
    Adriane Hanson

    Thanks Karl, this was super helpful! I've written a proof of concept script (https://github.com/uga-libraries/web-aip/blob/master/scripts/download_files.py) that uses the file type report and wget to download PDFs from multiple crawls at once. Our use case involves dozens of websites which will be split into multiple crawls, so I wanted to be able to handle them all in a single batch. And the script lets us make decisions about how the files are sorted (one folder per seed) and how the files are named. Next step is to talk to the curator about adjustments she wants to the workflow.

    I can also see where DownloadThemAll would be the easier solution if we have a specific site or two of interest outside of our anticipated regular crawl schedule.

     

    0
    Comment actions Permalink

Please sign in to leave a comment.