Repeating Directory

January 10, 2024 22:42

Hi,

I'm trying to capture the website www.royalroads.ca/

There is (I believe) a repeating directory at https://www.royalroads.ca/academic-regulations

I looked into my seed scope rules, and it indicates that urls which match the regular expression ^.*?(/.+?/).*?\1.*$|^.*?/(.+?/)\2.*$. will not be crawled. And this page says that ^.*?(/.+?/).*?\1.*$|^.*?/(.+?/)\2.*$. is designed to avoid repeating directories. I don't want to crawl a repeating directory if it will return a lot of repetitive information, but I also need to get those pages.

Any suggestions? I'm using the standard crawler and am thinking - maybe I should try brozzler?

Thanks in advance for your help.

Kate

Comments

3 comments

Skip Kendall January 11, 2024 13:37

Kate,

I don't see any links on that page to repeating directories. What's an example of what you're seeing?

Skip

0

Comment actions Permalink
Kate Chandler January 11, 2024 15:29

Hi Skip,

Thank you for helping.

Here's an example of what I'm seeing: when I try to click the bubbled numbers below (representing extra pages of search results) the buttons don't do anything.

0

Comment actions Permalink
Skip Kendall January 11, 2024 19:55

I suspected that was what you were having problems with. That is actually not what Archive-It describes as a repeating directory. A repeating directory would be something like https://www.royalroads.ca/academic-regulations/academic-regulations/academic-regulations. Buttons for subsequent pages are frequently tricky. Usually they work just fine but sometimes they don't and for various reasons. I think you're right that Brozzler should be the next step. If that doesn't work, I'd put in a ticket. The folks on the help desk are very good at diagnosing these things so should be able to tell you what's going wrong.

0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?