On this page:
- Why limit a crawl?
- How to limit an entire crawl by time and/or data
- How to limit the data for specific hosts at the collection level
- How to limit at the seed level
- Further information
Why limit a crawl?
Whether because of crawler traps or unexpected hosts, crawling too much data can negatively impact your annual subscription. Limiting the scope of your crawl is an easy way to ensure that you do not archive too much material.
These limits can apply to entire crawls in the form of duration and/or data limits. They can also be applied at the seed level, or the collection, to specific hosts that a crawler encounters during the course of its crawl; for example, you can specify the amount of data that you wish to crawl from facebook.com, www.youtube.com, en.wikipedia.org, etc. Directions for adding and editing both kinds of limits are included below, and you can find related information about modifying scope here.
How to limit an entire crawl by time and/or data
Time Limits For One Time Crawls - Test or Production
1 or 3 day time limits are recommended for all crawls on new seeds
There are no hard and fast rules for determining how much time a crawler will need to capture a seed or group of seeds. You have the option of selecting time limits that range between 10 minutes to 7 days, but we recommend a 1-3 day test crawl to start. With the information from your 1-3 day crawl you will be able to adjust time limit up or down as necessary for subsequent crawls.
Please note that time limits only determine the maximum time a crawl will run. It is possible for a crawl to complete before the time limit is hit.
Time Limits For Recurring Crawls
By default, every seed in your collection is assigned a crawl frequency when you add it, and this frequency in turn determines the default duration period (such as 1 day, 3 days, 1 week, etc.) of its crawls.
Every 12 Hours |
12 hours |
These twice-daily crawls restart every 12 hours. We strongly recommend running test crawls before scheduling your seeds to this frequency, as these crawls can quickly use up large amounts of your data budget. |
Daily |
24 hours |
Daily crawls will repeat every day and will run up to 24 hours. We strongly recommend running test crawls before scheduling your seeds to this frequency, as these crawls can quickly use up large amounts of your data budget. |
Weekly |
3 Days |
Weekly crawls will repeat every week and will run up to 3 days (72 hours) by default, but can be extended to run 5 to 7. |
Monthly |
3 Days |
Monthly Crawls will run every month and run up to 3 days (72 hours) by default, but can be extended to run 5 to 7 days. |
Every Two Months |
3 Days |
Bi-Monthly Crawls will repeat every two months and run for up to 3 days (72 hours) by default, but can be extended to run up to 7 days. |
Quarterly |
3 Days |
Quarterly crawls will repeat every three months and run for up to 3 days (72 hours) by default, but can be extended to run 5 to 7 days. |
Semi-Annual |
5 Days |
Semiannual crawls will repeat every six months. These crawls will run for up to 5 days by default, but can be adjusted to run 5 to 7 days. |
Annual |
5 Days |
Annual crawls will repeat every twelve months and will run for a maximum of 5 days by default, but can be adjusted to run 5 to 7 days. |
One-Time |
3 Days |
A One-Time crawl will run exactly once and not be scheduled for future crawls. These crawls will run up to 3 days (72 hours) by default, but can be extended to run 5 to 7 days. |
You can modify the default duration time, and add data/document limits that apply to these seeds' future crawls in our web application. To do so, first navigate to any collection's "Crawls" tab, then to the "Crawl schedule" sub-tab beneath it:
Click the "Edit Limits" button next to each listed crawl frequency in order to add/edit the time, data or documents limits associated to your future crawls:
How to limit the data for specific hosts at the collection level
In addition to any time or data limits imposed on any crawl as a whole, you can impose data limits to specific hosts, in order to limit how much material is archived from each of them. Frequently, we will recommend imposing this kind of limit when you crawl very large sites, like Facebook and Twitter; specific guidelines are available for those sites as well as for others. In general though, you can limit the amount of material that you archive from any host you wish.
Begin by navigating to the specific collection, then to the "Collection Scope" tab:
Here, you can limit the number of URLs that you archive from specific hosts, the amount of data that comes from specific hosts, and/or block or limit URLs that match specific strings or patterns of text.
How to block the crawler from archiving specific hosts
To block our crawler from archiving any URL it encounters from a specific host, from the "Collection Scope" tab, use the dropdown to select "Block Hosts" and add the names of the hosts–exactly as they appear in your crawl's Hosts report– to the "Enter Host(s)" text box, and click "Add Rule."
The rule will then appear in the below list of active scope modifications for that collection's crawls. Then, you can use the "Controls" toggle at the right to selectively switch it on or off, depending on whether you'd like to apply it to crawls:
How to limit the amount archived from specific hosts
Please note that a data limit placed upon a crawl will apply to New Data rather than Total Data. The reason for this is that, thanks to de-duplication, only New Data counts against your data budget.
To limit the number of documents, or the amount of total data it encounters from a specific host, start from the "Collection Scope" tab, and use the dropdown to select "Add Data Limits" or "Add Document Limits." Then, add the names of the hosts– exactly as they appear in your crawl's Host report– to the "Enter Host(s)" text box, and define a number of documents or amount of data (in gigabytes), and click "Add Rule."
The rule will then appear in the below list of active scope modifications for that collection's crawls. Then, you can use the "Controls" toggle at the right to selectively switch it on or off, depending on whether you'd like to apply it to crawls:
How to block specific URLs from archiving
To limit our crawler from archiving precise URLs, or URLs that contain specific text/patterns, start from the "Collection Scope" tab, and use the dropdown to select "Block URLs if..." and add the names of the hosts– exactly as they appear in your crawl's Host report.
Once you have done entered a host, you can use the second "Block URL If..." dropdown to block URLs that contain a specific string of text, or URLs that match a specific regular expression, using the below directions.
Block URLs containing specific text
We recommend blocking URLs that contain specific text whenever there are known areas of a site that you wish to avoid archiving, and when those areas can by identified by a string of text in all of their respective URLs.
To block our crawler from archiving any URL containing a specific string of text, select the "Block URL if it contains the text:" option from the drop-down menu, enter the text as it appears in the undesired URLs, and click the "Add Rule" button:
Block URLs that match a regular expression
Regular expressions are rules that our crawler can follow in order to identify URLs that might not always have the same string of text in them, but which nonetheless conform to a regular pattern. Often, these manifest in the form of crawler traps like online calendars, which can dynamically generate endless possible URLs with combinations of dates and times. Before attempting to use regular expression to control our crawler, we highly recommend reviewing our general guidance on regular expressions.
To block our crawler from archiving any URL that matches a given regular expression, select the "Block URL if it matches the regular expression:" option from the drop-down menu, write the regular expression to match the pattern of the undesired URLs, and click the "Add Rule" button:
This action will add your new rule to the list of scope modifications below. Note: If you would like assistance defining a regular expression for URLs that appear to conform to a specific pattern, please contact an Archive-It Web Archivist.
How to limit at the seed level
To modify scope at the level of a particular seed, navigate to the collection's "Seeds" tab, click on the hyperlinked URL of the seed for which you wish to make a modification, followed by that seed's "Seed Scope" tab. Then, add a data limit, or a block rule:
Further information
In addition to the above options for limiting the amount of material that your crawl archives, you can also limit a crawl by the type of material archives by limiting it to only archive PDFs.
Comments
0 comments
Please sign in to leave a comment.