Overview
While we continuously investigate and implement capture improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to capture or replay in their entirety. These difficulties affect all web crawlers, not just ours. When selecting seed URLs and reviewing your archived content, please keep these limitations in mind.
*For more information on what makes sites archive-friendly, there is an in-depth guide available from Stanford University Libraries.
On this page:
About
By default, all Archive-It crawling technology crawls the public web and not information protected behind logins/passwords. Archive-It partners can, however, crawl password protected content by entering their credentials in the Archive-It Web Application.
Note: This feature is incompatible with log-in processes that require two-step authentication, split username and password fields across web pages, sites that require a CAPTCHA, or sites with speciality certificates. |
Troubleshooting
For best results collecting password protected content, follow the steps outlined below:
- Review and follow the steps outlined in our article on Archiving password protected sites.
- Try crawling password protected content using Brozzler.
- Review your crawl’s Hosts report to check if URLs hosted behind authentication pages were collected.
Outcome
We will not be able to collect password protected content that uses anything beyond traditional username/password authentication systems. In some cases, password protected content may be collected but will not replay in Wayback.
Comments
0 comments
Please sign in to leave a comment.