I'm planning the harvest of a site which used to live at http://www.nzetc.org and now lives at http://nzetc.victoria.ac.nz/ There is a blanket redirect which redirects every URL on the old host to the new URL in a new host, which involves changing only the host part of the URL.
There are hundreds of links to the old host in places like wikipedia which we really don't want to break.
Is there a way to encode this without needing the crawler to crawl each page twice, once at the old URL and once at the new?
If I do need to crawl twice, what's the easiest / lowest impact method of doing that? My current plan is: (a) harvest the new site as normal (b) make a list of all URLs harvested by the crawler based on the logs (c) search and replace the host name (d) publish that somewhere the crawler can see and (d) add as a seed.
Please sign in to leave a comment.