Crawler refuses to collect?

Brian Hamilton

August 16, 2023 12:50

I'm working on a project to collect local county codes, and a couple websites are confounding my attempts to do so.

For example, I'm trying to crawl https://codelibrary.amlegal.com/codes/annearundel/latest/. I have tried crawling this page with every seed type and with Standard and Brozzler. The crawl reads as having collected data and pages, but when I try to access the archives pages via Wayback, the site only defaults to https://codelibrary.amlegal.com/codes/annearundel/latest/overview.

Any thoughts on what I'm doing wrong? Or is the site just not friendly to Archive-It's crawling methods?

0

Comments

4 comments

Skip Kendall August 16, 2023 12:57

Doesn't appear to be a crawling problem as I get the same behavior in my browser. Looks like they automatically redirect to the overview page if you try to land on /latest/. Seems like you should be able to click on the links in the left menu and get to what you want.

0

Comment actions Permalink
Brian Hamilton August 16, 2023 13:11 (Edited August 16, 2023 13:11)

Right?!

But the links on the left menu just redirect to the /latest page. Same when I try to reduce the seed down to https://codelibrary.amlegal.com/codes/annearundel. The crawl redirects to /latest/, or I get a 404 error.

0

Comment actions Permalink
Skip Kendall August 16, 2023 13:16

Huh. 'Tis certainly odd. I'd submit a ticket to see if the Archive-It folks can see what's going haywire.

0

Comment actions Permalink
Brian Hamilton August 16, 2023 13:20

Will do. I'm having the same issue with sites from library.municode.com as well. Hopefully if I can sort one out, the other will fall into place as well.

0

Comment actions Permalink

Please sign in to leave a comment.