Repeating Directory
Hi,
I'm trying to capture the website www.royalroads.ca/
There is (I believe) a repeating directory at https://www.royalroads.ca/academic-regulations
I looked into my seed scope rules, and it indicates that urls which match the regular expression ^.*?(/.+?/).*?\1.*$|^.*?/(.+?/)\2.*$. will not be crawled. And this page says that ^.*?(/.+?/).*?\1.*$|^.*?/(.+?/)\2.*$. is designed to avoid repeating directories. I don't want to crawl a repeating directory if it will return a lot of repetitive information, but I also need to get those pages.
Any suggestions? I'm using the standard crawler and am thinking - maybe I should try brozzler?
Thanks in advance for your help.
Kate
-
I suspected that was what you were having problems with. That is actually not what Archive-It describes as a repeating directory. A repeating directory would be something like https://www.royalroads.ca/academic-regulations/academic-regulations/academic-regulations. Buttons for subsequent pages are frequently tricky. Usually they work just fine but sometimes they don't and for various reasons. I think you're right that Brozzler should be the next step. If that doesn't work, I'd put in a ticket. The folks on the help desk are very good at diagnosing these things so should be able to tell you what's going wrong.
Please sign in to leave a comment.
Comments
3 comments