Sites in multiple languages
Rotary International maintains a couple of different sites that are available in multiple languages. Ideally we want to capture all language versions. And this should be easy, because each translated site is a subdirectory under the main domain:
So setting up the seed URL as rotary.org/ captures all languages.
But my most recent crawl of endpolio.org, which is set up similarly, only seems to be capturing the default English version of the site.
https://wayback.archive-it.org/2731/20200810170244/https://www.endpolio.org/
Maybe Brozzler can't access the other sites via the drop down menu - but I'm not really sure.
Anyone else encountered a similar problem? I'd like to avoid having to create separate seed URLs for each language, if I can.
-
Andy,
That dropdown is definitely what's blocking you and Brozzler won't help. The trouble is that it's not just a menu dropdown that needs to be reproduced; it's a dropdown that requires a particular selection to work. That'll never happen with a crawler. I think your only option is creating separate seeds for each language. However, the seeds themselves don't need to appear in the public interface for them to be available. As long as they're captured, the dropdown should allow users to navigate to them.
-
Believe it or not, it looks on this end like the legacy crawl technology actually had an easier time parsing this navigation, which might get presented differently to a crawler than quite how we see it in our browsers. Andy, check out how you were able to regularly archive these other languages with your annual crawl of endpolio.org up until the move to Brozzler:
- https://wayback.archive-it.org/2731/*/https://www.endpolio.org/de
- https://wayback.archive-it.org/2731/*/https://www.endpolio.org/es
- https://wayback.archive-it.org/2731/*/https://www.endpolio.org/it
Assuming that you want to continue using it for the rest of your annual crawl job, you can write us a ticket and we can look into the possibility of training Brozzler to interact with the drop-down like it does the one on your main domain. Skip is right in the meantime--that you would likely be able to add each seed (and discretely) before we can make and test code changes. I see seven of them:
-
Thanks, Skip and Karl,
I hadn't put it together that the difference between this and previous crawls of endpolio.org was Brozzler.
Adding to the weirdness is that fact I was able to capture all language sites in rotary.org with just the one seed URL using Brozzler. For example:
https://wayback.archive-it.org/2731/20200530203358/https://www.rotary.org/de
Looking at rotary.org live on the web a second time, I notice the Change Language menu there behaves differently than the dropdown menu on endpolio.org. For example, when I click “Change Language” on rotary.org and hover the cursor over one of the language selections, I see a URL in the browser status bar; whereas I don’t when I click the dropdown menu in endpolio.org. And maybe that makes all the difference.
I'll submit the support ticket you suggest.
Andy
Please sign in to leave a comment.
Comments
3 comments