URL redirect problem?
We would like to collect all of the PDF files located here: https://www.gov.nl.ca/ecc/publications/annual-reports/
However, when we run the seed, we get a "This page has not been archived here" page as a result. It has been more than 24 hours since we have tried.
It tuns out that the PDFs appearing on this page have URLs indicating a different file structure. For example, https://www.gov.nl.ca/ecc/files/ECCMAnnualReport2020-21.pdf
When we try to go to https://www.gov.nl.ca/ecc/files/, we get a 404 page, so we don't think https://www.gov.nl.ca/ecc/files/ will work as a seed either.
Can anyone explain what this problem is and how we might successfully harvest the PDFs?
Thanks,
Darren Furey
-
You need to rescope the crawl to include the gov.nl.ca/ecc/files host. Or you can patch crawl them in.
-
Thanks. So under Seed scope --> Add seed scope rule, choose "Expand scope to include URL if" "it contains the text" gov.nl.ca/ecc/files ?
Darren
Please sign in to leave a comment.
Comments
3 comments