Overview
It is possible to crawl, archive, and replay most websites built on the platform Wix. These sites sometimes do not require any special steps. This guide provides an overview of how to properly format, scope, and crawl Wix seeds.
Known issues
Some archived Wix sites may encounter an issue where drop-down menus don't appear to expand when hovered over. Our engineers are working on resolving this issue.
For a full list of known issues for archiving various platforms please visit our Status of monitored platforms page.
On this page:
Running your crawl
We highly recommend crawling Wix sites using Brozzler. When you use this capture technology, the only additional scoping necessary is:
- Block URL if it contains the text: blur_
If using the standard crawler, we recommend using the below scoping rules; it is not necessary to use these rules with Brozzler.
Scoping Wix seeds
Additional manual scoping options for Wix
New Wix seeds that contain wixsite.com or wix.com in their URLs will have the following default scoping rules automatically applied at the seed level when they are added to a collection. To learn more, including how you can add default scoping rules to existing seeds, please visit Sites with automated scoping rules.
We recommend that you add these same scoping rules manually to any Wix seeds that do not already contain the "wix" string in their URLs and which you intend to collect with "standard" crawling technology instead of Brozzler.
-
Ignore Robots.txt files on the following hosts in the Collection Scope tab:
- frog.wix.com
- static.parastorage.com
-
Expand the scope of your crawl to include URLs that contain the following strings of text at the seed level:
- frog.wix.com
- static.parastorage.com
- static.wixstatic.com
- sslstatic.wix.com
- Block URL if it contains the text: blur_ at the seed level
- Block the following regular expression at the seed or collection level in order to avoid the crawler traps generated by some Wix sites ^.*/[^/]{300,}$
Adding One Page Seeds: Captures of Wix sites made with the "standard" crawling technology may be improved by adding and crawling sub-pages(e.g. http://www.website.com/subpage), added as private One Page seeds. To do this, add sub-pages as private One Page seeds, add the recommended Wix scoping rules to each seed, and include them in a crawl with the main Wix seed:
When reviewing your crawls, remember that you should also navigate through the seed URL's Wayback link to QA the sub-pages.
What to expect from archived Wix seeds
Some Wix templates can result in incomplete capture or replay. If you encounter blank or missing images in replay and used Brozzler, we recommend trying a Standard crawl as a next step (do not forget to add the scoping rules listed above manually).
Comments
2 comments
Most wix seeds seem not to have wix.com or wixsite.com in their URL, so the scoping rules are not added by default and have to be added manually, which is very time consuming, especially given the recommendation to add one page seeds. Given this, I have 2 questions:
1) Can the criterion for automated recognition of wix-based sites for application of default wix scoping be expanded so that any site with Wix hosts (frog.wix.com etc) in its source code can include files from those hosts by default
2) If not, would it be advisable for curators to apply the recommended wix scope expansions at the collection level rather than the seed/page level? I already ignore robots on Wix hosts at the collection level.
This post is the first time I've heard of the strategy of using a "private one page seed". Can you explain this concept further and why it works for improving Wix captures? Is this a strategy that can be used in other situations?
Please sign in to leave a comment.