While we continuously investigate and implement capture improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to capture or replay in their entirety. These difficulties affect all web crawlers, not just ours. When selecting seed URLs and reviewing your archived content, please keep these limitations in mind:
- Dynamic content
- Streaming & Downloadable Media
- Password protected sites
- Form and Database-driven Content
- URLs that contain a #
- POST Requests
- Images or text size that adjust dynamically to browser size
- Maps that zoom in and out
- Downloadable files
- Media that requires clicking a “play” button
- Navigation menus
Streaming & Downloadable Media
Special scoping rules are needed to facilitate archiving common streaming video services like YouTube and Vimeo, but others may require custom solutions or further technical development. If you plan to archive sites that include a large volume of downloadable media, we recommend checking the sites in Wayback to make sure the media was captured to your satisfaction. Reviewing the File Types report is the most effective way to make sure media files were archived.
By default, all Archive-It crawling technology crawls the public web and not information protected behind logins/passwords. Archive-It partners can, however, crawl password protected content by entering their credentials in the Archive-It Web Application. This feature does not yet apply to two step authentication systems, or sites with speciality certificates.
Form and Database-driven Content
Elements that require a user’s input, like a form or search box, will generally not work in Wayback. However, in most cases the Archive-It crawlers are usually still able to access that content. Adding an additional seed that points the crawler directly to the content can help capture it more effectively and provide users a direct access point to it.
A webmaster can use a robot.txt exclusion to prevent certain content from being crawled. The Archive-It crawlers respect all robots.txt exclusions by default. To see if an entire site you wish to crawl is being blocked, check your seed site for a robots.txt exclusion file before you crawl, or check your seed status report after your crawl is complete. To check if part of your website or embedded content is blocked, please check your hosts report. If you wish to crawl a site blocked by robots, we encourage you to contact the webmaster of the blocked website to allow the Archive-it crawler in. The name (user-agent) of our crawler is archive.org_bot. There is also a Archive-It feature that allows users to override robots.txt blocks, which can be enabled upon request.
URLs that contain a #
The Standard crawler (Heritrix) removes any characters that follow # in a URL, making these URLs difficult to crawl and capture successfully. Brozzler is often better at crawling these types of URLs. If the seed URL contains a # or site you're crawling links to pages with #s in their URLs please try using Brozzler.
POST is an HTTP request-response method that is difficult for Archive-It’s Standard crawling technology (Heritrix) to capture, and difficult for Wayback to replay. Our beta crawling technology, Brozzler, can sometimes capture the functionality of pages employing POST requests, however, because of Wayback limitations, they generally can not be replayed. If you would like to use Brozzler on a site that uses POST requests please get in touch with us by submitting a support ticket.
*For more information on what makes sites archive-friendly, there is an in-depth guide available from Stanford University Libraries