Overview
While we continuously investigate and implement improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to collect or replay in their entirety. These difficulties affect all web crawlers, not just Archive-It's. When selecting seed URLs and reviewing your archived content, please keep these limitations in mind. For more information on what makes sites archive-friendly, see the Library of Congress's Creating Preservable Websites.
On this page:
About
Elements that require a user’s input, like a form or search box, will generally not work in Wayback. However, in most cases the Archive-It crawlers are usually still able to access that content. Adding an additional seed that points the crawler directly to the content can help capture it more effectively and provide users a direct access point to it.
Troubleshooting
There are two workarounds for collecting database driven content:
- If there are links into the raw content, our crawlers will be able to follow those as long as they are in scope for your crawl. Alternatively, if each result page has a unique URL, you can try adding each URL as a One Page helper seed to your collection and crawling it together with your main seed.
- If there is an XML sitemap or index to your seed site, we will be able to crawl all linked content. You can add the URL as a seed to your collection and crawl it together with your main seed.
We also recommend running a test crawl using Brozzler as it tends to work better on dynamic, database-driven content.
Outcome
Without a sitemap or links to raw content, our crawlers may be unable to collect this content. Even when form or database driven content has been collected successfully, features that require a user to interact with a site (e.g. enter a search term or fill out a form) to access content will generally not work in Wayback. To provide Wayback access to individual pages that have been collected, consider adding them as seed URLs to your collection and organizing them together with the main seed URL on your public landing page using our Groups feature.
Comments
0 comments
Please sign in to leave a comment.