Overview
While we continuously investigate and implement improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to collect or replay in their entirety. These difficulties affect all web crawlers, not just Archive-It's. When selecting seed URLs and reviewing your archived content, please keep these limitations in mind. For more information on what makes sites archive-friendly, see the Library of Congress's Creating Preservable Websites.
On this page:
About
The Standard crawler (Heritrix) removes any characters that follow the # symbol in a URL (also known as a pound sign, number sign, or hashtag), making these URLs difficult to crawl and capture successfully. This is because these URLs tend to be created dynamically in JavaScript.
Troubleshooting
Brozzler is often better at crawling these types of URLs. If the seed URL contains a # or site you're crawling links to pages with #s in their URLs, please try using Brozzler.
Additionally, try adding each URL with a # symbol as seed URL to your collection and crawling it with your main seed URL.
If Brozzler is still unable to archive the URLs, we recommend that you look for an alternative URL to crawl.
Outcome
Even where capture of these types of URLs are successful, there is not currently a way for us to replay most of these URLs as they function on the live web. For best results, we recommend right clicking on each URL with a # symbol in Wayback and selecting the "Open in New Tab" option.
Comments
0 comments
Please sign in to leave a comment.