While we continuously investigate and implement capture improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to capture or replay in their entirety. These difficulties affect all web crawlers, not just ours. When selecting seed URLs and reviewing your archived content, please keep these limitations in mind:
1. Dynamic content
- Images or text size that adjust dynamically to browser size
- Maps that zoom in and out
- Downloadable files
- Media that requires clicking a “play” button
- Navigation menus
2. Streaming & Downloadable Media
Streaming and downloadable media can be captured, but can be difficult to play back, and to archive reliably especially in large volumes. We have special scoping rules to facilitate archiving streaming video services like YouTube and Vimeo, but others may require custom solutions or further technical development. If you plan to archive sites which include a large volume of downloadable media, we suggest immediately checking the sites after they've been crawled to make sure the media was captured to your satisfaction. Reviewing the File Types report and viewing your archived site in proxy mode is the most effective way to make sure the media was archived.
3. Password-protected Sites
By default, our technology crawls the public web and not information protected behind logins/passwords. Archive-It partners can, however, crawl password protected content by entering their credentials in the web application. This does not yet apply to two step authentication systems, or sites with speciality certificates.
4. Form and Database-driven Content
If you need to interact with a site (for example enter a search term or fill out a form) to get to the content, Archive-It can have difficulty crawling the site, particularly if that site uses POST requests to serve up the data. If the site does not use POST requests, there are two workarounds to archiving database driven content. However, if there are links into the raw content, the crawler can follow those. Crawling an XML-based sitemap can, for instance, help reach and archive all of this “back-end” content.
5. Robots.txt Exclusions
A webmaster can use a robot.txt exclusion to prevent certain content from being crawled. Our crawler respects all robots.txt exclusions by default. To see if an entire site you wish to crawl is being blocked, check your seed site for a robots.txt exclusion file before you crawl, or check your seed status report after your crawl is complete.. To check if part of your website or embedded content is blocked, please check your hosts report. If you wish to crawl a site blocked by robots, we encourage you to contact the webmaster of the blocked website to allow the Archive-it crawler in. The name (user-agent) of our crawler is archive.org_bot. There is also a Archive-It feature that allows users to override robots.txt blocks, which can be enabled upon request.
*For more information on what makes sites archive-friendly, there is an in-depth guide available from Stanford University Libraries