Overview
While we continuously investigate and implement improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to collect or replay in their entirety. These difficulties affect all web crawlers, not just Archive-It's. When selecting seed URLs and reviewing your archived content, please keep these limitations in mind. For more information on what makes sites archive-friendly, see the Library of Congress's Creating Preservable Websites.
On this page:
About
By default, all Archive-It crawling technology crawls the public web and not information protected behind logins/passwords. Archive-It partners can, however, crawl password protected content by entering their credentials in the Archive-It Web Application.
Note: This feature is incompatible with log-in processes that require two-step authentication, split username and password fields across web pages, sites that require a CAPTCHA, or sites with speciality certificates. |
Troubleshooting
For best results collecting password protected content, follow the steps outlined below:
- Review and follow the steps outlined in our article on Archiving password protected sites.
- Try crawling password protected content using Brozzler.
- Review your crawl’s Hosts report to check if URLs hosted behind authentication pages were collected.
Outcome
We will not be able to collect password protected content that uses anything beyond traditional username/password authentication systems. In some cases, password protected content may be collected but will not replay in Wayback.
Comments
0 comments
Please sign in to leave a comment.