On this page:
- What are 'crawler traps' and why should we avoid them?
- How to identify a crawler trap
- What to look for in queued URLs
- How to prevent traps from being crawled
What are 'crawler traps' and why should we avoid them?
A crawler trap is a set of web pages that create an infinite number of URLs (documents) for our crawler to find, meaning that such a crawl could infinitely keep running and finding "new" URLs. Our crawlers have maximum crawl duration times that will stop them after a certain length of time. These traps can still archive quite a lot of content prior to that cut-off, so avoiding them is a key part of effectively managing your time and your data budgets.
You can use the below information to help you identify any crawler traps that may be a part of your scope, and to avoid crawling them in the future.
How to identify a crawler trap
The easiest place to identify potential traps in your seeds is from your Hosts report. Especially high numbers of "Queued" URLs from any particular host can be an indication of a crawler trap.
Click the hyperlinked number to view/download this list of queued URLs and determine if the URLs conform to any pattern that may potentially be crawled indefinitely (such as calendar dates or endlessly repeating or excessive directories), and therefore may be worth limiting in future crawls. If you are not sure whether the URLs in the list are valid and valuable or not, you can copy/paste them into your browser to view them on the live web.
What to look for in queued URLs
There are many types of crawler traps that you may encounter, such as wikis or forums containing infinite links for infinite types of viewing options, but we've identified a few of the more common ones you might see here. Before blocking specific patterns in URLs like these, be sure to copy and paste a few examples into your browser to make sure they are invalid or unnecessary.
Long messy strings:
http://www.example.org/-4RKgHfoN-6E11VCLhJOgPe79POuU2ZXhnTVbnUg_Zs.eyJpbnN0YW5jZUlkIjoiMTNkZDNmOTYtZTE1OC03NTU5LTEyMWMtZmM1YzE3MmQxMzk4Iiwic2lnbkRhdGUiOivZGUiOmZhbHNlLCJiaVRva2VIy…
These are indicative of a site created using the Wix platform. We have some specific instructions for Archiving Wix Sites.
Repeating Directories:
http://www.example.org/media/media/page/sites/css/html-reset.css
-and-
Extra Directories:
http://www.example.org/media/feed/pae/sites/all/themes/dev/custom/sites/all/themes/enviro-c4/css/html-reset.css
In most cases, large numbers of repeating directories or extra directories are a strong indication of a crawler trap. To be sure, copy and paste one or two of these URLs into your browser to confirm that they are invalid, meaning they do not lead you to any desired content. We have developed a Regular Expression to help mitigate these invalid URLs, which you can find in our Regular Expressions help page.
Calendars:
http://www.example.org/calendar/events?&page=1&mini=2015-09&mode=week&date=2021-12-04
Calendar pages are not always a problem, but have the potential to generate URLs infinitely into the future and past. A regular expression for blocking all calendar content can be found on our Regular Expressions help page. If you would like to capture only a specific date range please get in touch with us and we can help develop a regular expression for your use case.
How to prevent traps from being crawled
If you identify any traps in your Hosts report, you can block them from future crawls by using our web application's feature for limiting the scope of your crawls. Most often, we recommend identifying a string of text that appears in all of your invalid/superfluous URLs and then applying a rule to block those specific URLs in future crawls:
However, we have also devised a few useful regular expressions for URLs that generate infinitely without a single consistent string of text to be identified/blocked:
Comments
0 comments
Please sign in to leave a comment.