Many of our partners archive content from Wikipedia, but there are a few things you MUST do in order to crawl and archive it productively:
- Enter your seed URL correctly. A correct seed URL for Wikipedia might look something like this example: http://en.wikipedia.org/wiki/Internet_Archive. PLEASE NOTE: you should NOT include a / at the end of the URL. Wikipedia will not recognize any URL with a / at the end, so you will only end up crawling an error page.
- Because you can not include a / at the end of your seed URL, ALL of Wikipedia will be in scope when you crawl unless you add some constraints. There are a few different options depending on how much of Wikipedia you want to capture, so please choose one of the options below:
a) To capture just the one Wikipedia article that you've entered as a seed (in the style demonstrated above), set it's seed type as One Page.
b) If you want to capture links out from your Wikipedia article seed, the easiest thing to do is to set a document limit on the host wikipedia.org.
c) If you want just the first level of links from your Wikipedia seed to archive (i.e. all pages linked from your seed, but none from the subsequent pages), then you can use the One Page Plus External Links (One Page +) seed type. This will crawl your seed URL and each page linked from your seed URL, but nothing further.
FAQs
- Why are there so many documents from "upload.wikimedia.org" showing up in my crawl?
- "upload.wikimedia.org" is where most of the images included in Wikipedia articles are stored. Since in most cases you do want to capture the images on the page, there is no need to put any special rules in place for these pages.
Comments
0 comments
Please sign in to leave a comment.