Password-protected WordPress site
Does anyone have any advice on capturing an internal, password-protected WordPress site? Our organization's intranet is a WordPress site and I have been having some trouble with playback.
In playback, when I try to navigate to a new page using the site's top menu or sidebar menus it redirects back to the login before navigating to the new page. Initially this structure was blocking the crawler from capturing pages beyond the homepage at all - but I adjusted the crawl settings and added some scope rules and that seemed to solve that issue. It now captures the entire intranet site, however, in playback it still redirects to the login page before navigating to any pages when you use menu buttons (and doesn't automatically login using the credentials in the seed settings like it does for the homepage). Sometimes when I re-enter my login info it will then continue to the page I selected, but other times it just redirects to my WordPress account profile or dashboard and you get stuck in a loop. If I enter the URL for the page directly into the address bar, it works fine; so it is clearly capturing all of the site pages successfully. I just can't easily navigate to them in playback.
Crawl settings & seed scope rules currently applied:
- Brozzler
- Standard +
- ignore robots.txt
- ignore crawl delay
- accept URL if it contains "wp-login.php"
- accept URL if it contains "wp-content"
- several "block URL if it contains..." for some specific external sites linked to the intranet that I don't want to capture
Any advice would be greatly appreciated! Thanks!
-
Official comment
Collecting and replaying password-protected content can be challenging, due to a variety of authentication measures on various platforms. Each website's case can be different; for this reason, I recommend submitting a support ticket with the crawl details.
It sounds like your crawl may have been successful in collecting some of the intranet pages and this may be more of a replay issue. We can verify this if you submit a support ticket, and we might even be able to help improve replay.
If it is more of an issue of collecting the intranet pages, in addition to the general advice on Archiving password protected sites and Troubleshooting password-protected sites, it can sometimes help to add an additional seed just for the login page that points to the protected page where you'd like to start collecting.
The webpage for the additional login seed should:- Have both fields for the login credentials (username and password) on the same page.
-
Not have only the username field with the password field on a second page.
-
Not have additional fields.
-
Not have any kind of CAPTCHA.
-
Not require 2-factor authentication.
- Have its access set to private.
- Have a one page seed type (or one page+ if the host domain's different).
- Have the same login credentials as the main seed applied in its seed settings.
Then try crawling the main seed together with the additional login seed to see if it can collect more of the password-protected pages. Hope this information helps!
Comment actions
Please sign in to leave a comment.
Comments
1 comment