How to avoid crawler traps on Wix.com sites
As it gains wider popularity generally, our partners are increasingly archiving websites built with Wix, or Wix.com, another platform for web design and hosting like Wordpress and Squarespace. We've noticed that some, though not all, of these Wix sites generate nasty crawler traps, which can waste partners' data budgets on seemingly endless invalid URLs. You'll know this kind of trap when you see it; it manifests as thousands, if not tens of thousands of such URLs in your post-crawl report's "Queued" column.
If you notice this happen to any of the sites in your crawl, you can add a new scoping rule to prevent it from happening in the future. Just tell the crawler to block all URLs that match the regular expression ^.*/[^/]{300,}$ -- like this:

We'll add any new advice to the Archive-It User Guide entry on Wix sites. In the meantime, this one little rule should help to keep crawls manageable and accounts under budget!
Please sign in to leave a comment.
Comments
0 comments