On this page:
- Why Expand a crawl's scope?
- How to expand scope at the collection level
- How to archive URLs that contain specific text
- How to archive URLs that match a SURT
- How to archive URLs that match a regular expression
- How to expand scope at the seed level
Why expand a crawl's scope?
If your crawl archives less content by default than you desired to archive, from either your seeds or from the other host domains to which they link, you can expand its scope to include more. Assigning any given seed the "Standard+" or "One Page+" seed type is one way to expand the number of documents that our crawler considers "in scope" for your crawl. To more selectively include more content, such as additional sub-domains of your seed URL or linked documents from among otherwise "out of scope" hosts, use the following directions to specifically direct our crawler to those materials.
*Note: Be as specific as possible when identifying the text and/or patterns of URLs that you wish to add to your crawling scope, as these expansions will apply to all hosts in the crawl and can therefore drastically expand the scope of your crawl.
How to expand scope at the collection level
To expand the scope of your crawl, go to your collection and select Collection Scope.
From the drop-down menu, select Accept Document if. You have the option to overide the default scope to include document URLs that match a specified text string, regular expression, or SURT.
How to archive URLs that contain specific text
To archive URLs that contain a specific string of text, select Accept Document if > it Contains the text: and then enter the string as it appears in the desired URLs in the text box. To save, click Add Rule.
Any URL discovered in future crawls that contains the text string specified will be automatically collected.
How to archive URLs that match a SURT
A SURT ("Sort-friendly URI Reordering Transform") is a slightly different way to express locations on the web than with a normal URL. We typically recommend using SURTs to scope many different URLs from a one or more sub-domains at once. By default, sub-domains are not considered in scope, so you can use a SURT rule to ensure that they are collected.
To archive URLs that match a SURT, select Accept Document if > it Matches the SURT: and then enter your SURT in the text box. To save, click Add Rule.
Any documents that match the SURT will be automatically collected.
How to archive URLs that match a regular expression
Regular expressions are rules that our crawler can follow to identify URLs that might not always have the same string of text in them, but which conform to a regular pattern. Before attempting to use a regular expression, we highly recommend reviewing our general guidance on regular expressions. We do not, however, expect you to learn how to use regular expressions. If your desired scope expansion might benefit from a regular expression, please contact us for assistance.
To archive URLs that match a regular expression, select Accept Document if > it Matches the Regular Expression: and then enter the regular expression into the text box. To save, click Add Rule.
Any documents that match the regular expression will be automatically collected.
How to expand scope at the seed level
To modify scope at the level of a particular seed, go to your collection's Seeds tab and click the hyperlinked URL of the seed you want.
In the seed's Seed Scope tab, select Accept Document if and then specified text string, SURT, OR regular expression. To save, click Add Rule.
Any documents that match your seed scope rule will be automatically collected.
Comments
0 comments
Please sign in to leave a comment.