Some Archive-It partners crawl the news section results of specific web searches in order to enhance or replace traditional “clippings” collections. These news searches have URLs that may be used like any other web archiving seed. Here is how we recommend crawling and archiving them:
- Format your seeds to match these examples precisely, with only the search terms replaced where indicated (in bold):
Google: https://www.google.com/search?q=Internet+Archive&tbm=nws&num=100
Bing: https://www.bing.com/news/search?q=Internet+Archive&qs=n - Assign your seed/s the One Page Plus seed type. This enables the crawler to archive the linked articles. If you prefer to archive only the results page itself, use One Page.
- Add a seed-level scoping rule to Ignore Robots.txt to each seed.
- When crawling Google search results specifically, add an additional scoping rule to the seed/s to include URLs that contain the text: https://www.google.com/url?q=
Comments
0 comments
Please sign in to leave a comment.