Overview
Wikipedia is a free online encyclopedia. This guide provides an overview of how to properly format, scope, and crawl Wikipedia seeds.
Known issues
There are currently no known issues for archiving Wikipedia. For a full list of known issues for archiving various platforms please visit our Status of monitored platforms page.
On this page:
- How to format and scope your Wikipedia seeds
- Running your crawl
- What to expect from your archived Wikipedia seeds
How to format and scope your Wikipedia seeds
Many of our partners archive content from Wikipedia, but there are a few things you MUST do in order to crawl and archive it productively:
- Enter your seed URL correctly. A correct seed URL for Wikipedia might look something like this example: http://en.wikipedia.org/wiki/Internet_Archive. PLEASE NOTE: you should NOT include a / at the end of the URL. Wikipedia will not recognize any URL with a / at the end, so you will only end up crawling an error page.
- Because you can not include a / at the end of your seed URL, ALL of Wikipedia will be in scope when you crawl unless you add some constraints. There are a few different options depending on how much of Wikipedia you want to capture, so please choose one of the options below:
a) To capture just the one Wikipedia article that you've entered as a seed (in the style demonstrated above), set it's seed type as One Page.
b) If you want to capture links out from your Wikipedia article seed, the easiest thing to do is to set a document limit on the host wikipedia.org.
c) If you want just the first level of links from your Wikipedia seed to archive (i.e. all pages linked from your seed, but none from the subsequent pages), then you can use the One Page Plus External Links (One Page +) seed type. This will crawl your seed URL and each page linked from your seed URL, but nothing further.
Running your crawl
Crawl Wikipedia seeds using either Standard or Brozzler.
What to expect from your archived Wikipedia seeds
FAQs:
- Why are there so many documents from "upload.wikimedia.org" showing up in my crawl?
- "upload.wikimedia.org" is where most of the images included in Wikipedia articles are stored. Since in most cases you do want to capture the images on the page, there is no need to put any special rules in place for these pages.
Comments
0 comments
Please sign in to leave a comment.