I am in charge of the European Union's website archive and sometimes we find the problem of having to incorporate external links into our pages, which are often subsequently abandoned. This happens in many cases, for example, this website compiles a series of European projects including a redirection to the project page:
https://cordis.europa.eu/project/id/004525 -> web spam in Japanese
https://cordis.europa.eu/project/id/034990 -> domain for sale
https://cordis.europa.eu/project/id/507295 -> sports/gambling/phishing blog?
https://cordis.europa.eu/project/id/314548 -> porn...
I believe that the solution to these problems is to act on the live website by redirecting the links to archived pages in the Internet Archive instead of the live website. However, this is not always easy to implement, as each website is managed differently and also because the link to the archive would mean the loss of content on pages that are subsequently updated.
I wonder if there would be any sophisticated way to try to detect the status of a live website to exclude certain contents. For the examples above I am thinking, for example, of blocking the capture of pages that do not include the chains "funded by" + "European Union" in their code, but this is only a first idea.
Anyone else encountered a similar problem?
Please sign in to leave a comment.