Why download my data?
Downloading the archives that you build through our service allows you to:
Access the WARC files in your collections directly and provide them to researchers
Analyze the WARCs in order to generate custom reports and/or visualizations
Provide local, restricted access to web archives not made publicly visible through Archive-It
Preserve the WARC files locally at your home institution
Preserve the WARC files using a third party system
How to download your data
The guides below describe the different recommended ways that you can download your collection data, based on your technical resources and/or proficiency. If ever you require assistance with any of these, please contact one of our Web Archivists.
These guides are intended for partners with varying technical skill levels, but there are two primary skill-level groups:
- Anyone with knowledge of the way data is moved over the Internet using web browsers and who is comfortable using graphical programs. No command-line knowledge is required. Downloading, installing, and running browser plugins and third-party software on your computer is required. A general knowledge of WARC files is also necessary.
- Anyone in category 1 who is also interested in using the Cygwin or Unix shell to execute programs that can download and verify data. A general but not detailed knowledge of the HTTP protocol is necessary. You should be comfortable navigating your file system from the command-line. Use this guide if you are interested in writing a shell script to download your data.
Notes on terminology
WARC/ARC files are the files into which data crawled using Archive-It is stored. Each file may contain multiple digital objects, including HTML, images, and videos. (Note that collection data can consist of both WARC and ARC files depending upon when they were archived through our service. Throughout these guides, the term “WARC files” refers to both WARC and ARC files.)
What is a WARC file?
- WARC files are defined by ISO 28500
- The standard was created by an international body of experts in digital preservation, including people from the Internet Archive and the Library of Congress
- A WARC file is a container for web archives
- It preserves web data exactly as it was returned from the webserver. This is called “native format”
- It also contains a host of relevant metadata that allows a forensic examiner to verify the integrity of all that has been captured
For detailed information on the WARC file specification, see: https://iipc.github.io/warc-specifications/.
During download, WARC file data may be lost due to network or other system issues. To verify that a downloaded file is consistent with the file located on the Archive-It download site, an MD5 checksum value is provided with each WARC file. This value is calculated using an algorithm that incorporates file information into a single string of text. Using a tool like
md5sum (on Unix) or Cygwin will verify that downloaded WARC files match the original files stored on the Archive-It download site.