Why download my data?
Downloading the archives that you build through our service allows you to:
Access the WARC files in your collections directly and provide them to researchers
Analyze the WARCs in order to generate custom reports and/or visualizations
Provide local, restricted access to web archives not made publicly visible through Archive-It
Preserve the WARC files locally at your home institution
Preserve the WARC files using a third party system
How to download your data
Credentialed users of the Archive-It web application can download their archives manually through a web browser or programmatically from the command line by using the Web Archiving Systems API (WASAPI). If you require assistance with the API or the data files, please contact one of our Web Archivists.
Audience
Anyone with general knowledge of the ways that data are moved over the Internet using web browsers and APIs can retrieve their files for download. No command-line knowledge is required. Downloading, installing, and running third-party command line utilities on your computer is required for automated processes. A general knowledge of WARC files is also helpful.
Notes on terminology
WARC and ARC files
WARC and their predecessor ARC files are the files into which data crawled using Archive-It is stored. Each file may contain multiple digital objects, including HTML, images, and videos. (Note that collection data can consist of both WARC and ARC files depending upon when they were archived through our service. Throughout these guides, the term “WARC files” refers to both WARC and ARC files.)
What is a WARC file?
- WARC files are defined by ISO 28500
- The standard was created by an international body of experts in digital preservation, including people from the Internet Archive and the Library of Congress
- A WARC file is a container for web archives
- It preserves web data exactly as it was returned from the webserver. This is called “native format”
- It also contains a host of relevant metadata that allows a forensic examiner to verify the integrity of all that has been captured
For detailed information on the WARC file specification, see: https://iipc.github.io/warc-specifications/.
Checksums
During download, WARC file data may be lost due to network or other system issues. To verify that a downloaded file is consistent with the file located on the Archive-It download site, both md5 and sha1 checksum values are retrieved via WASAPI. These values are calculated using algorithms that incorporate file information into a single string of text. Using a tool like md5sum
(on Unix) or Cygwin can verify that downloaded WARC files match the original files stored on the Archive-It download site.
Comments
0 comments
Please sign in to leave a comment.