Help setting up a crawl
Hello, I have been crawling my organization's pages for Covid-19 updates and most of them I have been able to capture. However one page that is getting updated regularly I can't seem to get.
The live page is here: https://campusadvisories.gwu.edu/university-updates
A recent archived page is here: https://wayback.archive-it.org/5184/20200426171825/https://campusadvisories.gwu.edu/university-updates
You can see the updates but when you click on one of the articles in the archived page, it says not in archive. I have the seed set to standard+. Is there something I should be doing differently? Or perhaps a scope rule that I need to do?
Thanks in advance for any help!
-
Hi Brigette,
I recommend changing the seed URL here so that there is no trailing slash ( / ) at the end of it. You can use the little pencil icon for that. Then, to make sure it doesn't collect more than you bargained for, change its seed type from Standard Plus to One Page Plus. Sorry that's not the most intuitive strategy, but I think that an immediate redirect at the seed level might be negatively affecting your intent to scope in those links to the individual update pages. A quick test here on my side indicates that a new crawl with the above settings will archive the pages, but let us know if you still see anything otherwise.
Please sign in to leave a comment.
Comments
2 comments