New Wix seeds that contain wixsite.com or wix.com in their URLs will have the following default scoping rules automatically applied at the seed level when they are added to a collection. To learn more, including how you can add default scoping rules to existing seeds, please visit Sites with automated scoping rules.
It’s possible to crawl, archive, and replay most websites built on the platform Wix. These sites sometimes do not require any special steps. However, some Wix templates can result in incomplete capture or replay. We highly recommend crawling Wix sites using Brozzler. If using the standard crawler, we recommend using the below scoping rules; it is not necessary to use these rules with Brozzler. Either of these options will increase the likelihood that the website will capture and replay fully, however some Wix sites experience further replay issues. Please see our prioritization of support tickets for more information.
Scoping rules for Wix
Capturing Necessary Content: We recommend using Brozzler, but if not, to ensure that all necessary scripts, stylesheets, and embedded media are archived, we recommend modifying the scope of crawls of Wix-based sites to include the following rules:
- Ignore Robots.txt files on the following hosts:
- frog.wix.com
- static.parastorage.com
- Expand the scope of your crawl to include URLs that contain the following strings of text:
- frog.wix.com
- static.parastorage.com
- static.wixstatic.com
- sslstatic.wix.com
Avoiding Crawler Traps: Some templates used for Wix-based sites generate a significant number of long invalid URLs that look like the one below:
http://www.example.com/uU5dR1gpXCHX45K8aOMct11OrLtyrYJeUnw_RxaUsg.eyJpbnN0YW5jZUlkIjoiMTNkZDc
Add the following Regular Expression to keep the crawler from capturing unnecessary URLs generated by some wix sites ^.*/[^/]{300,}$
Adding One Page Seeds: Replay of sites built on the Wix platform may be improved by adding and crawling sub-pages(e.g. http://www.website.com/subpage) added as private One Page seeds. To do this, add sub-pages as private One Page seeds, add the recommended Wix scoping rules to each seed, and include them in a crawl with the main Wix seed. Keep in mind when you are reviewing your crawls, you should also navigate through the base domain's Wayback link to QA the sub-pages.
Comments
2 comments
Most wix seeds seem not to have wix.com or wixsite.com in their URL, so the scoping rules are not added by default and have to be added manually, which is very time consuming, especially given the recommendation to add one page seeds. Given this, I have 2 questions:
1) Can the criterion for automated recognition of wix-based sites for application of default wix scoping be expanded so that any site with Wix hosts (frog.wix.com etc) in its source code can include files from those hosts by default
2) If not, would it be advisable for curators to apply the recommended wix scope expansions at the collection level rather than the seed/page level? I already ignore robots on Wix hosts at the collection level.
This post is the first time I've heard of the strategy of using a "private one page seed". Can you explain this concept further and why it works for improving Wix captures? Is this a strategy that can be used in other situations?
Please sign in to leave a comment.