Regular expressions (RegEx) are used to to recognize patterns in the URLs that our crawler encounters on the web. They can be used in order to tell our crawler which links from a given host you prefer to either include or block from archiving.
In general, we recommend modifying the scope of your crawls with precise text strings or SURT rules, which may be easily defined from analyzing the desired or undesired URLs in your crawl's reports. However, if you notice patterns in the URLs that you wish to include or exclude from your crawls that defy these methods, consult to guidance below to using regular expressions and consider reaching out to our Web Archivists directly for assistance.
On this page:
How to modify crawl scope with a regular expression
You can apply any of the regular expressions listed below or a custom RegEx to either limit or expand the scope of your crawl, using our web application's scope modification features. To do so, navigate to the "Collection Scope" tab of your chosen collection, or the "Seed Scope" tab of your chosen seed, and follow our standard guidance for expanding or limiting your scope.
How to construct a regular expression
There are several useful resources on the web that explain RegEx, how they work, and how to make them:
- For background information: http://en.wikipedia.org/wiki/Regex
- For example constructions: http://www.digitalamit.com/article/regular_expression.phtml (note that these examples apply to Windows files rather than URLs, and that we do not require using a / on either side of a RegEx)
- For Java RegEx syntax, which we accept: http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html
- To test your RegEx: https://regex-testdrive.com/en/
- Enter your RegEx into the "Regular Expressions" field.
- Click on "Regular Expression Constructs" to view a RegEx cheat sheet.
- Paste test document URLs into the "Target String" field.
- Under "Flags" check the box beside "Enables multiline mode" so that the tester will evaluate content across multiple lines in the "Target String" field.
- Click Test button in order to see which URLs your regex matches.
If you know of another good resource, let us know and we will include it it here!
Important notes on formatting
When using regular expressions with our scope modification features, be sure to always include a ^ at the beginning and a $ at the end. For example, the regular expression to block any URL containing the word "calendar" is:
^.*calendar.*$
Common and useful regular expressions
While different sites represent different scoping challenges, there are a few issues that we see often, and for which we have developed effective regular expressions. The two most common issues tend to occur in sites built upon content management systems. While most such sites archive without issue, from time to time we observe them generating invalid URLs, often with .css or .js extensions. These URLs usually follow one or both of the following patterns, and can be constrained by applying RegEx.
Long Invalid URLs
Frequently occurring in Wix-based sites, some websites generate a significant number of long invalid URLs that look like the one below:
http://www.example.com/uU5dR1gpXCHX45K8aOMct11OrLtyrYJeUnw_RxaUsg.eyJpbnN0YW5jZUlkIjoiMTNkZDc1ZTQtY2E2MC00ZGJkLWU4YTUtYTg xZjMxMzIyODVjIiwic2lnbkRhdGUiOiIyMDE1LTA1LTI4VDAwOjI5OjI5LjA3OFoiLCJpcEFuZFBvcnQiOiIyMDcuMjQxLjIyNi4xMTYvNDU2NTEiLCJkZW1vTW9kZSI6ZmFsc2UsImJpVG9rZW4iOiIxZjJjNTNlYS1hMGJjLTAyYmYtNDEwYi04N2E1N2IwNWZmYmMifQ
These URLs are usually unnecessary to replay the site correctly, but can still vastly expand the amount of documents and volume of data archived, which negatively affects your subscription budget.
If you see evidence of this crawler trap in a site you've archived, please visit our page on Archiving Wix Sites and add the regex listed there to your collection scope.
Repeating Directories
Some content management systems dynamically generate URL in which the same directory repeats itself once or more over the course of the URL string, for example: http://www.example.org/media/feed/pae/sites/all/modules/dev/custom/js/custom.js/sites/all/themes/enviro-c4/css/html-reset.css.
This generally does not happen in valid URLs, and frequently consumes the crawler with endless possible iterations, to negative effect on your account's document and data budget. We recommend using the following regular expression to block any URL from being archived if a directory repeats itself:
^.*?(/.+?/).*?\1.*$|^.*?/(.+?/)\2.*$
If you believe that your crawl may have encountered this kind of crawler trap, and might benefit from the use of the above RegEx, be sure to consult the "Docs" and "Queued" lists in your Hosts report and confirm that one or more hosts are generating URLs with repeating directories.
Extra Directories
Like repeating directories, extra/superfluous directories can be dynamically generated by a content management system and subsequently trap our crawler. These URLs usually have a number directories in them that do not follow a valid path for URLs on the live web. Since the exact directories included in these invalid URLs vary from URL to URL within a site, and also vary among different sites, we've come up with a regular expression that you can edit to meet the needs of your specific case.
Here's an example of the regular expression, which contains many of the directories that are commonly found in these sorts of URLs:
^.*(/misc|/sites|/all|/themes|/modules|/profiles|/css|/field|/node|/theme){3}.*$
In the above example, if any three (3) of the directories listed in parentheses in the RegEx appear in a URL in a row, it will be blocked. In order to meet the needs of your specific case, you will need to look at the list of queued URLs for a particular crawl in its Hosts report and determine which precise directories need be added to the list above. To adjust this regular expression, add applicable directories to the end of the existing list, with a "|/" between each directory. Be sure that there are no characters between the last directory name and the final parenthesis. The list of directories may be longer or shorter depending on the site, and it may take a little tweaking and even a series of test crawls in order to make sure you have all the correct directories listed.
Calendars
The regular expression to block any URL containing the word "calendar" is:
^.*calendar.*$
In this regex, for example, any URLs beginning with any number of any characters ('.*'), then the term calendar, followed by any number of any characters will be blocked. With this regular expression in place, URLs such as http://myuniversity.com/events/calendar/event1-1-2016 will not be crawled.
Comments
0 comments
Please sign in to leave a comment.