On this page:
- Why Expand a crawl's scope?
- How to expand scope at the collection level
- How to archive URLs that contain specific text
- How to archive URLs that match a SURT
- How to archive URLs that match a regular expression
- How to expand scope at the seed level
Why expand a crawl's scope?
If your crawl archives less content by default than you desired to archive, from either your seeds or from the other host domains to which they link, you can expand its scope in order to include more. Assigning any given seed the "Standard+" or "One Page+" seed type is one easy way to quickly expand the amount of content that our crawler considers to be "in scope" for your crawl. In order to more selectively include more content, such as additional sub-domains of your seed URL or linked documents from among otherwise "out of scope" hosts, follow the directions provided below for specifically directing our crawler to those materials.
*Note: Be as specific as possible when identifying the text and/or patterns of URLs that you wish to add to your crawling scope in the manners described below, as these expansions will apply to all hosts in the crawl and can therefore expand the scope of your crawl drastically if you are not careful.
How to expand scope at the collection level
To expand the scope of your crawl to archive more material of a given type, navigate to the "Collection Scope" tab of your collection's management area, and then "Expand Scope to include URL if..." in the drop-down menu. From this interface, you can add rules to your collection's crawling scope that tell our crawler explicitly to archive any URLs that it encounters and which contain specific strings or patterns of text in the form of a SURT or regular expression.
How to archive URLs that contain specific text
To expand the scope of your crawls to include URLs that contain a specific string of text, select the option to include a URL "if it Contains the text:" from the drop-down menu and enter the string as it appears in the desired URLs, and then click Add Rule. Any URL discovered in future crawls that contains the string specified will automatically be archived.
How to archive URLs that match a SURT
A SURT ("Sort-friendly URI Reordering Transform") is a slightly different way to express locations on the web than with a normal URL. We typically recommend using SURTs to scope many different URLs from a one or more sub-domains at once. By default, sub-domains are not considered in scope, so you can use a SURT rule in order to ensure that they are archived nonetheless. To do so, select the option to include a URL "if it Matches the SURT" from the drop-down menu, enter your SURT into the text box, and click Add Rule.
This action will add your new rule to the list of scope modifications below, and future crawls will automatically archive any URLs that match your specified SURT.
How to archive URLs that match a regular expression
Regular expressions are rules that our crawler can follow in order to identify URLs that might not always have the same string of text in them, but which nonetheless conform to a regular pattern. Before attempting to use regular expression to control our crawler, we highly recommend reviewing our general guidance on regular expressions. We do not, however, expect all of our partners to learn how to use regular expressions themselves. If you think that your desired scope expansion might require a regular expression, please contact Archive-It's Web Archivists for assistance.
When prepared, choose the option to include a URL if "it Matches the Regular Expression:" from the drop-down menu, enter the regular expression into the text box, and click Add Rule.
Once added to your list of scope modifications, this regular expression will apply to all future crawls and tell our crawler to archive matching URLs.
How to expand scope at the seed level
To modify scope at the level of a particular seed:
- Navigate to the collection's "Seeds" tab
- Click the hyperlinked URL of the seed you want to modify
- Click the seed's "Seed Scope" tab.
- Use the dropdown to expand the scope of your crawl, and select Add Rule.
Please sign in to leave a comment.