On this page:
- Why Expand a crawl's Scope?
- How to expand your crawl's scope
- How to archive URLs that contain specific text
- How to archive URLs that match a SURT
- How to archive URLs that match a regular expression
Why expand a crawl?
If your crawl archives less content by default than you desired to archive, from either your seeds or from the other host domains to which they link, you can expand its scope in order to include more. Assigning any given seed the "Standard+" or "One Page+" seed type is one easy way to quickly expand the amount of content that our crawler considers to be "in scope" for your crawl. In order to more selectively include more content, such as additional sub-domains of your seed URL or linked documents from among otherwise "out of scope" hosts, follow the directions provided below for specifically directing our crawler to those materials.
*Note: Please endeavor to be as specific as possible when identifying the text and/or patterns of URLs that you wish to add to your crawling scope in the manners described below, as these expansions will apply to all hosts in the crawl and can therefore expand the scope of your crawl drastically if you are not careful.
How to expand scope at the collection level
If you wish to to expand the scope of your crawl to archive more material of a given type, navigate to the "Crawl Scope" tab of your collection's management area, followed by the "Expand Scope Rules" sub-tab:
From this interface, you may add rules to your collection's crawling scope that tell our crawler explicitly to archive any URLs that it encounters and which contain specific strings or patterns of text, directions for which are below.
How to archive URLs that contain specific text
To expand the scope of your crawls to include URLs that contain a specific string of text, select the option to include a URL "if it Contains the text:" from the drop-down menu and enter the string as it appears in the desired URLs:
Click the "Add Rule" button and any URL discovered in future crawls that contains the string specified will automatically be archived.
How to archive URLs that match a SURT
A SURT ("Sort-friendly URI Reordering Transform") is a slightly different way to express locations on the web than with a normal URL. We typically recommend using SURTs to scope many different URLs from a one or more sub-domains at once. By default, sub-domains are not considered in scope, so you can use a SURT rule in order to ensure that they are archived nonetheless. To so, select the option to include a URL "if it Matches the SURT" from the drop-down menu, enter your SURT into the text box, and click he "Add Rule" button:
This action will add your new rule to the list of scope modifications below, and future crawls will automatically archive any URLs that match your specified SURT.
How to archive URLs that match a regular expression
general guidance on regular expressions. We do not, however, expect all of our partners to learn how to use regular expressions themselves. If you think that your desired scope expansion might require a regular expression, please contact Archive-It's Web Archivists for assistance.are rules that our crawler can follow in order to identify URLs that might not always have the same string of text in them, but which nonetheless conform to a regular pattern. Before attempting to use regular expression to control our crawler, we highly recommend reviewing our
When prepared, choose the option to include a URL if "it Matches the Regular Expression:" from the drop-down menu, enter the regular expression into the text box, and click the "Add Rule" button:
Once added to your list of scope modifications below, this regular expression will apply to all future crawls and tell our crawler to archive matching URLs.
How to expand scope at the seed level
To modify scope at the level of a particular seed, navigate to the collection's "Seeds" tab, click on the hyperlinked URL of the seed for which you wish to make a modification, followed by that seed's "Seed Scope" tab. Then, use the dropdown to expand the scope of your crawl: