Archiving Wikipedia pages

Overview

Wikipedia is a free online encyclopedia. This guide provides an overview of how to properly format, scope, and crawl Wikipedia seeds.

Known issues

There are currently no known issues for archiving Wikipedia. For a full list of known issues for archiving various platforms please visit our Status of monitored platforms page.

On this page:

How to format and scope your Wikipedia seeds
Running your crawl
What to expect from your archived Wikipedia seeds

How to format and scope your Wikipedia seeds

Many of our partners archive content from Wikipedia, but there are a few things you MUST do in order to crawl and archive it productively:

Enter your seed URL correctly. A correct seed URL for Wikipedia might look something like this example: http://en.wikipedia.org/wiki/Internet_Archive. PLEASE NOTE: you should NOT include a / at the end of the URL. Wikipedia will not recognize any URL with a / at the end, so you will only end up crawling an error page.
Because you can not include a / at the end of your seed URL, ALL of Wikipedia will be in scope when you crawl unless you add some constraints. There are a few different options depending on how much of Wikipedia you want to capture, so please choose one of the options below:

a) To capture just the one Wikipedia article that you've entered as a seed (in the style demonstrated above), set it's seed type as One Page.

b) If you want to capture links out from your Wikipedia article seed, the easiest thing to do is to set a document limit on the host wikipedia.org.

c) If you want just the first level of links from your Wikipedia seed to archive (i.e. all pages linked from your seed, but none from the subsequent pages), then you can use the One Page Plus External Links (One Page +) seed type. This will crawl your seed URL and each page linked from your seed URL, but nothing further.

Running your crawl

Crawl Wikipedia seeds using either Standard or Brozzler.

What to expect from your archived Wikipedia seeds

In your crawl report, you may see several documents from the upload.wikimedia.org host. This host is where most of the images included in Wikipedia articles are stored. Since in most cases you do want to collect the images on the page, there is no need to create any special scope rules for Wikipedia seeds.

Articles in this section

Overview

Known issues

How to format and scope your Wikipedia seeds

Running your crawl

What to expect from your archived Wikipedia seeds

Comments

Articles in this section

Overview

Known issues

How to format and scope your Wikipedia seeds

Running your crawl

What to expect from your archived Wikipedia seeds

Related articles