Workflow for downloading all PDFs from a PDF only crawl

March 08, 2021 19:54

We're experimenting with using Archive-It to get state government documents from agency websites. We used a PDF only crawl and hoped to be able to download all the PDFs so they could be saved locally and cataloged individually.

I see how to download a single PDF by navigating to the URL on the file type report and how to download the WARC for the crawl using WASAPI, but not how to get the whole set of PDFs. Does anyone have a workflow we could borrow for either downloading a batch of PDFs using the URLs in the file type report or for extracting them from the WARC?

Thanks so much!

Comments

2 comments

Karl Blumenthal March 09, 2021 23:18 (Edited March 09, 2021 23:18)
Hi Adriane,

A couple of options to start with:
- There are a number of different browser extensions out there that perform the kind of bulk downloading that you describe, but I've used DownloadThemAll most of the time and it's met my needs. It should enable you to download all of the linked PDFs when you have your file type report's list of them in front of you. Or,
- If you can run wget locally, then you can direct it to download that full list of PDF files. Just download the list from the report, find/replace to add the Archive-It Wayback prefix onto the front of each,* and use wget's -i flag in order to point it towards your list of download URLs.
Let us know how it goes, though!

* The "Archive-It Wayback prefix" is everything before the original live PDF's URL in an example like this one: https://wayback.archive-it.org/12264/3/http://www.adasoutheast.org/webinars/2018/disabilityhx/020818-ppt.pdf. It includes the Wayback domain, the Archive-It collection number, and a timestamp, the latter of which can be abbreviated to "3" in these cases in order to redirect to the most recent capture.
0

Comment actions Permalink
Adriane Hanson March 11, 2021 21:50

Thanks Karl, this was super helpful! I've written a proof of concept script (https://github.com/uga-libraries/web-aip/blob/master/scripts/download_files.py) that uses the file type report and wget to download PDFs from multiple crawls at once. Our use case involves dozens of websites which will be split into multiple crawls, so I wanted to be able to handle them all in a single batch. And the script lets us make decisions about how the files are sorted (one folder per seed) and how the files are named. Next step is to talk to the curator about adjustments she wants to the workflow.

I can also see where DownloadThemAll would be the easier solution if we have a specific site or two of interest outside of our anticipated regular crawl schedule.

0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?