Overview
Some Archive-It partners crawl the news section results of specific web searches in order to enhance or replace traditional “clippings” collections. These news searches have URLs that may be used like any other web archiving seed. This guide provides an overview of how to properly format, scope, and crawl news clippings from web searches.
Known Issues
There are no known issues with archiving news clippings from web searches.
You can find a full list of known issues for archiving various platforms on our Status of monitored platforms page.
On this page
How to format your seeds
- Format your seeds to match these examples precisely, with only the search terms replaced where indicated (in bold):
- Google: https://www.google.com/search?q=Internet+Archive&tbm=nws&num=100
- Bing: https://www.bing.com/news/search?q=Internet+Archive&qs=n
- Assign your seed/s the One Page Plus seed type. This enables the crawler to archive the linked articles. If you prefer to archive only the results page itself, use One Page.
How to scope news clippings from web search seeds
- Add a seed-level scoping rule to Ignore Robots.txt to each seed.
- When crawling Google search results specifically, add an additional scoping rule to the seed/s to include URLs that contain the text: https://www.google.com/url?q=
Comments
0 comments
Please sign in to leave a comment.