While we continuously investigate and implement capture improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to capture or replay in their entirety. These difficulties affect all web crawlers, not just ours. When selecting seed URLs and reviewing your archived content, please keep these limitations in mind:
- Dynamic content
- Flash
- Streaming & Downloadable Media
- Password protected sites
- Form and Database-driven Content
- Robots.txt
- URLs that contain a #
- POST Requests
Dynamic content
While many sites with dynamic content can be archived without issue, there are some types of dynamic content that can be difficult to capture or replay. Particularly, anything highly dependent upon human interaction (for example, if a click is needed to activate something), or Javascript (for example, when you mouse over a word and a drop-down menu suddenly appears). While each situation is different and can sometimes need special attention, it is helpful to employ these recommendations for troubleshooting. Here are some examples of dynamic content that can be challenging to archive:
- Images or text size that adjust dynamically to browser size
- Maps that zoom in and out
- Downloadable files
- Media that requires clicking a “play” button
- Navigation menus
- Javascript based pagination
Flash
A type of dynamic content, websites use the Adobe platform Flash for animations, graphics and videos. With Adobe's statement that it will not support Flash beyond December 2020, we are investigating the potential solutions for archival replay but do not yet have a timeline for the complex and specific engineering work that will be needed. Performing Quality Assurance can help to capture the necessary files to aid future replay.
Streaming & Downloadable Media
Special scoping rules are needed to facilitate archiving common streaming video services like YouTube and Vimeo, but others may require custom solutions or further technical development. If you plan to archive sites that include a large volume of downloadable media, we recommend checking the sites in Wayback to make sure the media was captured to your satisfaction. Reviewing the File Types report is the most effective way to make sure media files were archived.
Password-protected Sites
By default, all Archive-It crawling technology crawls the public web and not information protected behind logins/passwords. Archive-It partners can, however, crawl password protected content by entering their credentials in the Archive-It Web Application. This feature does not yet apply to two step authentication systems, or sites with speciality certificates.
Form and Database-driven Content
Elements that require a user’s input, like a form or search box, will generally not work in Wayback. However, in most cases the Archive-It crawlers are usually still able to access that content. Adding an additional seed that points the crawler directly to the content can help capture it more effectively and provide users a direct access point to it.
Robots.txt Exclusions
A webmaster can use a robot.txt exclusion to prevent certain content from being crawled. The Archive-It crawlers respect all robots.txt exclusions by default. To see if an entire site you wish to crawl is being blocked, check your seed site for a robots.txt exclusion file before you crawl, or check your seed status report after your crawl is complete. To check if part of your website or embedded content is blocked, please check your hosts report. If you wish to crawl a site blocked by robots, we encourage you to contact the webmaster of the blocked website to allow the Archive-it crawler in. The name (user-agent) of our crawler is archive.org_bot. There is also a Archive-It feature that allows users to override robots.txt blocks, which can be enabled upon request.
URLs that contain a #
The Standard crawler (Heritrix) removes any characters that follow # in a URL, making these URLs difficult to crawl and capture successfully. Brozzler is often better at crawling these types of URLs. If the seed URL contains a # or site you're crawling links to pages with #s in their URLs please try using Brozzler.
POST Requests
POST is an HTTP request-response method that is difficult for Archive-It’s Standard crawling technology (Heritrix) to capture, and difficult for Wayback to replay. Our beta crawling technology, Brozzler, can sometimes capture the functionality of pages employing POST requests, however, because of Wayback limitations, they generally can not be replayed. If you would like to use Brozzler on a site that uses POST requests please get in touch with us by submitting a support ticket.
*For more information on what makes sites archive-friendly, there is an in-depth guide available from Stanford University Libraries
Comments
0 comments
Please sign in to leave a comment.