Error when crawling non-escaped & signs
We’re currently crawling and archiving our legacy site from a legacy database, which includes URLs that include the html-entity of "section signs" with ampersants.
When crawling the site, the Archive-It crawler misinterprets the non-escaped & sign and converts § (without the ;) into § - the § sign and thus links to non-working URLs like https://wayback.archive-it.org/30334/20250708185241/https://germanhistorydocs.ghi-dc.org/sub_doclist.cfm?sub_id=334%C2%A7ion_id=7
Since we don't have an easy way to fix the HTML-code that is currently generated, we wanted to ask if you encountered this problem and have any ideas on how to fix the issue so that users can navigate to the linked content directly from the archived page?
Thank you!
Daniel Burckhardt and Katharina Hering
German Historical Institute Washington, DC
-
Hello Katharina and Daniel. Thanks for the detailed report - that's a weird one, but I think we've fixed it! At your next convenience, can you please replay the archived pages again and let me know if they still misinterpret the encoded characters? From the page here for instance, can you now navigate successfully among the "Part ___" links that were malformed prior?
You might need to perform a "hard refresh" (hold the Shift key while pressing your web browser's refresh button) to see the fix for the first time if you've had the page open recently. But please let me know if I may take a second look at the issue here or anywhere else that you see it on your sites.
-
Hi Karl:Thank you so much for looking into this and for your helpful response. Many apologies for not responding sooner.This looks fantastic -- thank you! The only remaining problem seems to be in chapter 3: From Vormärz to Prussian Dominance, 1815-1866 [https://wayback.archive-it.org/30334/20250708174528/https://germanhistorydocs.ghi-dc.org/section.cfm?section_id=9] and in chapter 9: Two Germanies (1961-1989) [https://wayback.archive-it.org/30334/20250708175238/https://germanhistorydocs.ghi-dc.org/section.cfm?section_id=15]. Here, the problem with the non-escaped & signs in the table of contents for the documents and images in sub-chapters persists.
For example:
We can access the overview of all chapter 9 sub-chapters in the Wayback machine:
But once we click on the sub-chapters with the documents and images, like “Shadow of the Wall,” the links don’t resolve: https://wayback.archive-it.org/30334/20250708183020/https://germanhistorydocs.ghi-dc.org/sub_doclist.cfm?sub_id=29%EF%BF%BDion_id=15
The same thing happens with the images in that chapter:
I tried the hard refresh, but the problem persisted in chapters 3 and 9.All other chapters work like a charm now!
We’d be grateful for any advice.
Many thanks to you and your amazing IA team!Katja, Daniel and Insa(German Historical Institute) -
You're right! The encoding is a little different on the pages of these two chapters, but I've applied a similar fix to accommodate them as well. These should be navigable now:
- https://wayback.archive-it.org/30334/20250708174528/https://germanhistorydocs.ghi-dc.org/section.cfm?section_id=9
- https://wayback.archive-it.org/30334/20250708175238/https://germanhistorydocs.ghi-dc.org/section.cfm?section_id=15
Thanks for catching this one. My apologies for the delay.
-
Hi Karl, hi Archive-It Team:
- Thanks again for fixing the error when crawling the Error when crawling non-escaped & signs in our archived site: German History in Documents and Images (see thread from last year):
There is one remaining problem: Some of the internal links don’t resolve in the archived site – it might be the same issue with the non-escaped & signs
For example, if you go to the page: Ernst Moritz Arndt, the German Fatherland: https://wayback.archive-it.org/30334/20250709063653/https://germanhistorydocs.ghi-dc.org/sub_document.cfm?document_id=237
You’ll find one internal link to the portrait of Ernst Moritz Arndt in the abstract, which resolves well: https://wayback.archive-it.org/30334/20250710134603/https://germanhistorydocs.ghi-dc.org/sub_image.cfm?image_id=572&language=english
However, there is another link to the Frankfurt National Assembly, which doesn’t resolve and shows the error message: “This page has not been archived here”:
However, when accessing the image directly through the Wayback machine, it’s been archived: https://wayback.archive-it.org/30334/20250710135425/https://germanhistorydocs.ghi-dc.org/sub_image.cfm?image_id=2223
Could you kindly look into this?
Thanks so much!
Katharina and Daniel (GH)
-
Hi, Katharina and Daniel. Please check your inbox for a response from me. The page you referenced was not collected, which is why it resolves to a 'not archived here' page. Your crawl finished due to a time limit and the page referenced was in the queue. In the future, you can check your Host report's queued documents and then resume the crawl within 7 days if you want to collect the queued documents.
Please sign in to leave a comment.
Comments
5 comments