Error when crawling non-escaped & signs

July 16, 2025 16:30

We’re currently crawling and archiving our legacy site from a legacy database, which includes URLs that include the html-entity of "section signs" with ampersants.

When crawling the site, the Archive-It crawler misinterprets the non-escaped & sign and converts &sect (without the ;) into § - the § sign and thus links to non-working URLs like https://wayback.archive-it.org/30334/20250708185241/https://germanhistorydocs.ghi-dc.org/sub_doclist.cfm?sub_id=334%C2%A7ion_id=7

(instead of https://wayback.archive-it.org/30334/20250708185253/https://germanhistorydocs.ghi-dc.org/sub_doclist.cfm?sub_id=334&section_id=7)

Since we don't have an easy way to fix the HTML-code that is currently generated, we wanted to ask if you encountered this problem and have any ideas on how to fix the issue so that users can navigate to the linked content directly from the archived page?

Thank you!

Daniel Burckhardt and Katharina Hering
German Historical Institute Washington, DC

Comments

5 comments

Karl Blumenthal August 04, 2025 13:35

Hello Katharina and Daniel. Thanks for the detailed report - that's a weird one, but I think we've fixed it! At your next convenience, can you please replay the archived pages again and let me know if they still misinterpret the encoded characters? From the page here for instance, can you now navigate successfully among the "Part ___" links that were malformed prior?

https://wayback.archive-it.org/30334/20250708180708/https://germanhistorydocs.ghi-dc.org/sub_docs.cfm?section_id=7

You might need to perform a "hard refresh" (hold the Shift key while pressing your web browser's refresh button) to see the fix for the first time if you've had the page open recently. But please let me know if I may take a second look at the issue here or anywhere else that you see it on your sites.

0

Comment actions Permalink
Katharina Hering September 11, 2025 16:47

Hi Karl:

Thank you so much for looking into this and for your helpful response. Many apologies for not responding sooner.

This looks fantastic -- thank you! The only remaining problem seems to be in chapter 3: From Vormärz to Prussian Dominance, 1815-1866 [https://wayback.archive-it.org/30334/20250708174528/https://germanhistorydocs.ghi-dc.org/section.cfm?section_id=9] and in chapter 9: Two Germanies (1961-1989) [https://wayback.archive-it.org/30334/20250708175238/https://germanhistorydocs.ghi-dc.org/section.cfm?section_id=15]. Here, the problem with the non-escaped & signs in the table of contents for the documents and images in sub-chapters persists.

For example:

We can access the overview of all chapter 9 sub-chapters in the Wayback machine:

https://wayback.archive-it.org/30334/20250708183020/https://germanhistorydocs.ghi-dc.org/sub_docs.cfm?section_id=15

But once we click on the sub-chapters with the documents and images, like “Shadow of the Wall,” the links don’t resolve: https://wayback.archive-it.org/30334/20250708183020/https://germanhistorydocs.ghi-dc.org/sub_doclist.cfm?sub_id=29%EF%BF%BDion_id=15

The same thing happens with the images in that chapter:

https://wayback.archive-it.org/30334/20250708183031/https://germanhistorydocs.ghi-dc.org/sub_imglist.cfm?sub_id=106%EF%BF%BDion_id=15

I tried the hard refresh, but the problem persisted in chapters 3 and 9.

All other chapters work like a charm now!

We’d be grateful for any advice.

Many thanks to you and your amazing IA team!

Katja, Daniel and Insa

(German Historical Institute)

0

Comment actions Permalink
Karl Blumenthal October 13, 2025 13:38
You're right! The encoding is a little different on the pages of these two chapters, but I've applied a similar fix to accommodate them as well. These should be navigable now:
1. https://wayback.archive-it.org/30334/20250708174528/https://germanhistorydocs.ghi-dc.org/section.cfm?section_id=9
2. https://wayback.archive-it.org/30334/20250708175238/https://germanhistorydocs.ghi-dc.org/section.cfm?section_id=15
Thanks for catching this one. My apologies for the delay.
0

Comment actions Permalink
Katharina Hering March 06, 2026 18:54
Hi Karl, hi Archive-It Team:
1. Thanks again for fixing the error when crawling the Error when crawling non-escaped & signs in our archived site: German History in Documents and Images (see thread from last year):
https://wayback.archive-it.org/30334/20250708174528/https://germanhistorydocs.ghi-dc.org/section.cfm?section_id=9

There is one remaining problem: Some of the internal links don’t resolve in the archived site – it might be the same issue with the non-escaped & signs

For example, if you go to the page: Ernst Moritz Arndt, the German Fatherland: https://wayback.archive-it.org/30334/20250709063653/https://germanhistorydocs.ghi-dc.org/sub_document.cfm?document_id=237

You’ll find one internal link to the portrait of Ernst Moritz Arndt in the abstract, which resolves well: https://wayback.archive-it.org/30334/20250710134603/https://germanhistorydocs.ghi-dc.org/sub_image.cfm?image_id=572&language=english

However, there is another link to the Frankfurt National Assembly, which doesn’t resolve and shows the error message: “This page has not been archived here”:

https://wayback.archive-it.org/30334/20250709063653/http://www.germanhistorydocs.ghi-dc.org/sub_image.cfm?image_id=2223&language=english

However, when accessing the image directly through the Wayback machine, it’s been archived: https://wayback.archive-it.org/30334/20250710135425/https://germanhistorydocs.ghi-dc.org/sub_image.cfm?image_id=2223

Could you kindly look into this?

Thanks so much!

Katharina and Daniel (GH)
0

Comment actions Permalink
Linda at Archive-It April 03, 2026 01:31 (Edited April 03, 2026 01:32)

Hi, Katharina and Daniel. Please check your inbox for a response from me. The page you referenced was not collected, which is why it resolves to a 'not archived here' page. Your crawl finished due to a time limit and the page referenced was in the queue. In the future, you can check your Host report's queued documents and then resume the crawl within 7 days if you want to collect the queued documents.

0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?