On this page:
- Why limit a crawl?
- How to limit an entire crawl by duration and data
- How to limit the data crawled for specific hosts
- Further information
Why limit a crawl?
Whether because of crawler traps or unexpectedly voluminous directories within your seed sites, crawling too much data can negatively impact your annual subscription. Limiting the scope of your crawl is therefore an easy way to ensure that you do not archive too much material.
These limits may apply to entire crawls in the form of duration and/or data limits. They may also be applied at the level of the specific hosts that our crawler encounters during the course of its crawl, meaning that you can specify the amount of data that you wish to crawl from your seed or any other host (ie. facebook.com, www.youtube.com, en.wikipedia.org, etc.) to which it leads. Directions for adding and editing both kinds of limits are included below.
Please note that a data limit placed upon a crawl will apply to New Data rather than Total Data. The reason for this is that, thanks to de-duplication, only New Data counts against your data budget.
How to limit an entire crawl by duration and data
By default, every seed in your collection is assigned a crawl frequency when you add it, and this frequency in turn determines the default duration period (such as 1 day, 3 days, 1 week, etc.) of its crawls. You may modify the default duration time that applies to these seeds' future crawls in our web application. To do so, first navigate to any collection's "Crawls" tab, then to the "Crawl schedule" sub-tab beneath it:
Click the "Edit Limits" button next to each listed crawl frequency in order to add or edit the time and/or data limits associated to your future crawls:
How to limit the data crawled for specific hosts
In addition to any time or data limits imposed on any crawl as a whole, you may impose data limits to specific hosts in order to limit how much material is archived from each of them. Frequently, we will recommend imposing this kind of limit when you crawl very large sites, like Facebook and Twitter, just to ensure that you do not waste your annual subscription on material very remote from the material in which you are interested in these sites. Specific guidelines are available for those sites as well as for others. In general though, you can follow the directions below to limit the amount of material that you archive from any host you wish.
Begin by navigating in our web application to the collection management screen for you collection, then to the "Crawl Scope" tab, and finally to the "Host Rules" pane under that tab:
With this interface, you may limit the number of URLs that you archive from specific hosts, and/or block or limit URLs that match specific strings or patterns of text.
How to block the crawler from archiving specific hosts
To block our crawler from archiving any URL it encounters from a specific host, begin in the "Host Rules" pane shown above. Under the "Add Host Rule" heading, select the "Block Hosts" option from the drop-down menu and add the names of the hosts–exactly as they appear in your crawl's Hosts report–to the "Enter Host(s)" text box:
Click the "Add Rule" button to add this new block to your crawl scope. It will then appear in the below list of active scope modifications for your collection's crawls, from which you may use the "Controls" toggle at the right to selectively switch it on or off:
How to limit the amount archived from specific hosts
To limit the number of URLs or amount of total data that our crawler can archive from a specific host, begin in the "Host Rules" pane shown above. Under the "Add Host Rule" heading, select the "Add Data Limits" or "Add Document Limits" option from the drop-down menu, add the names of the hosts–exactly as they appear in your crawl's Hosts report–to the "Enter Host(s)" text box, and define a number of documents or amount of data (in gigabytes) that our crawler is permitted to archive:
Click the "Add Rule" button to add each new type of limit–data or documents–to your crawl scope. Each will then appear in the below list of active scope modifications for your collection's crawls, from which you may use the "Controls" column at the right to discard if ever you chose to do so in the future:
How to block specific URLs from archiving
To limit our crawler from archiving precise URLs, or any URLs that contain specific text or patterns, begin in the "Host Rules" pane shown above. Under the "Add Host Rule" heading, select the "Block URLs if..." option from the drop-down menu, and add the names of the hosts–exactly as they appear in your crawl's Hosts report–to which this rule will apply into the "Enter Host(s)" text box:
Once you have done this, you may select whether you want to block URLs that contain a specific string of text, URLs that match a specific SURT, or URLs that match a specific regular expression, directions for each of which may be found below.
Block URLs containing specific text
We recommend blocking URLs that contain specific text whenever there are known areas of a site that you wish to avoid archiving, and when those areas can by identified by a string of text in all of their respective URLs.
To block our crawler from archiving any URL containing a specific string of text, select the "Block URL if it contains the text:" option from the drop-down menu, enter the text as it appears in the undesired URLs, and click the "Add Rule" button:
Block URLs that match a regular expression
Regular expressions are rules that our crawler can follow in order to identify URLs that might not always have the same string of text in them, but which nonetheless conform to a regular pattern. Often, these manifest in the form of crawler traps like online calendars, which can dynamically generate endless possible URLs with combinations of dates and times. Before attempting to use regular expression to control our crawler, we highly recommend reviewing our general guidance on regular expressions.
To block our crawler from archiving any URL that matches a given regular expression, select the "Block URL if it matches the regular expression:" option from the drop-down menu, write the regular expression to match the pattern of the undesired URLs, and click the "Add Rule" button:
This action will add your new rule to the list of scope modifications below. Note: If you would like assistance defining a regular expression for URLs that appear to conform to a specific pattern, please contact an Archive-It Web Archivist.
How to limit at the seed level
To modify scope at the level of a particular seed, navigate to the collection's "Seeds" tab, click on the hyperlinked URL of the seed for which you wish to make a modification, followed by that seed's "Seed Scope" tab. Then, add a data limit, or a block rule:
In addition to the above options for limiting the amount of material that your crawl archives, you may also limit a crawl by the type of material archives by limiting it to only archive PDFs.