Crawling eLibrarys
I wondered if anyone had attempted to crawl an eLibrary (see link). It doesn't look exactly conducive to crawling due to the multiple levels of links, but I'm not sure. In this case, I got a 403 - Forbidden error, which suggests the error is coming from the server I'm crawling rather than Archive-It.
Note: I'm a government librarian and this is all available for public access as it is. Not trying to get protected material here.
https://www.ethicsrulings.pa.gov/WebLink/Browse.aspx?dbid=0&repo=Ethics&cr=1
-
Official comment
eLibrarys is a challenging platform for web archiving due to the document-tree structure it employs and POST requests in the links.
Sometimes multiple crawls with some additional seeds can help collect different parts of the site. The Standard crawler can often help collect the documents (often PDFs). And Brozzler can help with some POST requests, but often these can't replay in Wayback.
The HTTP 403 (Forbidden) error is coming from the website owners' servers (see this list of Error Codes).
If you submit a support ticket, we can provide more specific strategies for collecting this instance of eLibrarys or information website owners may need for their Allow Lists.Comment actions
Please sign in to leave a comment.
Comments
1 comment