Why download my data?

Downloading the archives that you build through our service allows you to:

Access the WARC files in your collections directly and provide them to researchers

Analyze the WARCs in order to generate custom reports and/or visualizations

Provide local, restricted access to web archives not made publicly visible through Archive-It

Preserve the WARC files locally at your home institution

Preserve the WARC files using a third party system

How to download your data

Credentialed users of the Archive-It web application can download their archives manually through a web browser or programmatically from the command line by using the Web Archiving Systems API (WASAPI).

Audience

Anyone with general knowledge of the ways that data are moved over the Internet using web browsers and APIs can retrieve their files for download. No command-line knowledge is required. Downloading, installing, and running third-party command line utilities on your computer is required for automated processes. A general knowledge of WARC files is also helpful.

Notes on terminology

WARC and ARC files

WARC and their predecessor ARC files are the files into which data crawled using Archive-It is stored. Each file may contain multiple digital objects, including HTML, images, and videos. (Note that collection data can consist of both WARC and ARC files depending upon when they were archived through our service. Throughout these guides, the term “WARC files” refers to both WARC and ARC files.)

What is a WARC file?

WARC files are defined by ISO 28500.
The standard was created by an international body of experts in digital preservation, including people from the Internet Archive and the Library of Congress.
A WARC file is a container for web archives.
It preserves web data exactly as it was returned from the webserver. This is called "native format."
It also contains a host of relevant metadata that allows a forensic examiner to verify the integrity of all that has been captured.

For detailed information on the WARC file specification, see: https://iipc.github.io/warc-specifications/

Checksums

During download, WARC file data may be lost due to network or other system issues. To verify that a downloaded file is consistent with the file located on the Archive-It download site, both md5 and sha1 checksum values are retrieved via WASAPI. These values are calculated using algorithms that incorporate file information into a single string of text. Using a tool like md5sum (on Unix) or Cygwin can verify that downloaded WARC files match the original files stored on the Archive-It download site.

Articles in this section

Partner Guide to Downloading Archive-It Data

Why download my data?

How to download your data

Audience

Notes on terminology

WARC and ARC files

Checksums

Comments

Articles in this section

Why download my data?

How to download your data

Audience

Notes on terminology

WARC and ARC files

Checksums

Related articles