Overview

Partners can use Archive-It's implementation of the Web Archiving Systems API (WASAPI) from a web browser or a command line terminal to find and download their WARC files and associated technical metadata. The API supports several advanced options for partners to find and download these files by collection, date and timespans, and other attributes described below.

Note: The data volumes returned by WASAPI represent the compressed file size, as reflected by the .warc.gz file extension. This is the final WARC size on disk, or the amount of space needed to store the WARC after it's downloaded.

This means the data volume queried with WASAPI will be different than what you see in and retrieve from your Archive-It account. The data volumes displayed in your Archive-It account represent the uncompressed amount of data that was collected during crawling, including both the website content and the technical metadata added before it's compressed and stored as WARCs.

Data entities

Each response to a WASAPI query includes a count of the total number of matched WARC files and the following information about each of them.

Attribute	Explanation	Example
account	Unique ID number for your Archive-It account	1036
checksums	Hexadecimal values for each file’s unique checksums	md5: 06000c146149788c092b37bc2583c889 sha1: e83e2e3d8348ce72512916c44ec0afa4386875bb
collection	Unique ID number for the Archive-It collection containing the file	6850
crawl	Unique ID number for the crawl job that produced the file	538557
crawl-start	Timestamp for the beginning of the crawl job that created the file	2018-01-07T17:00:09Z
crawl-time	Timestamp for the creation of the file	2018-01-07T17:00:16Z
store-time	Timestamp for the deposit of the WARC into storage	2018-01-07T19:05:21Z
filename	Name of the file in storage	ARCHIVEIT-6580-MONTHLY-JOB538557-20180107170016446-00000.warc.gz
filetype	Format of the file in storage	warc
locations	URLs where the file can be downloaded from storage. There are two locations to reflect our primary and backup locations.	https://warcs.archive-it.org/webdatafile/ARCHIVEIT-6580-MONTHLY-JOB538557-20180107170016446-00000.warc.gz https://archive.org/download/ARCHIVEIT-6580-MONTHLY-JOB538557-20180107-00000/ARCHIVEIT-6580-MONTHLY-JOB538557-20180107170016446-00000.warc.gz
size	The file’s data volume in bytes	52706601

WASAPI responds to queries with JSON objects by default, so the response record for https://archive.org/download/ARCHIVEIT-6580-MONTHLY-JOB538557-20180107-00000/ARCHIVEIT-6580-MONTHLY-JOB538557-20180107170016446-00000.warc.gz would appear in the browser as:

This json response is giving you the same information as in the chart above, as well as link to each of the matched WARCs.

Basic use

Querying

The first step is to query the database for the particular set of WARCs you are interested in downloading. In a web browser, use the general URL for all WASAPI queries to find all WARC files stored by your Archive-It account:

https://warcs.archive-it.org/wasapi/v1/webdata

The above URL may then be appended with further parameters after a ? operator in order to limit or filter results by the data attributes in the table above.

For instance, to find WARC files from only the Archive-It collection in the example above:

https://warcs.archive-it.org/wasapi/v1/webdata?collection=6580

Or, for all WARC files produced by just the single crawl job:

https://warcs.archive-it.org/wasapi/v1/webdata?crawl=538557

Downloading WARCs

There are two ways to download WARCs via WASAPI: web browser or command line. While the outcome is the same, the processes are different, each including multiple steps.

Note: An Archive-It WARC is no bigger than 1GB, so a single crawl can generate multiple WARCs.

Web browser

Partners can use WASAPI in a web browser to manually download WARC files via the hyperlinked "locations" URLs in each response object. There are two locations to reflect the two copies of all WARCs--a primary (starts with "https://warcs.archive-it.org/webdatafile") and a backup (starts "https://archive.org/download"). We recommend downloading the primary files because backups may not yet be available for recently created WARCs.

To download a WARC manually, click the primary location hyperlink for each file returned in your query. If your file count is over 100, see the Pagination section below to ensure you download all files. Depending on the number and size of WARCs, this can can be time consuming, so if you want to bulk download WARCs you may consider using an external bulk download tool.

Command line

You can use the command line to automatically download WARCs and/or batch download WARCs. In order to do either option, it is necessary to first parse these locations from the responses with a JSON processor like jq, then to download with a retrieval tool like wget. Ensure that your machine has jq and wget, and then follow the below steps:

1. To create a list of storage locations for all WARC files from a WASAPI query:

curl -u <user>:<password> "https://warcs.archive-it.org/wasapi/v1/webdata?<your-parameters>" | jq -r .files[].locations[0] > url.list

2. To then download the files from their storage locations:

wget --http-user=<user> --http-password=<password> --accept txt,gz -i url.list

Use the advanced options below to further refine and specify the precise WARC files that you want to download either from the web browser or command line.

Advanced use

Advanced filtering

WASAPI supports the use of multiple and repeatable filters. Archive-It partners can add additional filters after & operators in order to further define specific lists of WARC files across collections and/or by timespans.

For instance, to query files from multiple Archive-It collections at once:

https://warcs.archive-it.org/wasapi/v1/webdata?collection=6580&collection=6503

Date and time ranges

WASAPI supports querying files by date and time ranges by appending -before and/or -after to date and time attributes. These date and time entities conform RFC3339 specification (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS), but may also be abbreviated (for instance to include only YYYY-MM or YYYY).

For instance, to query only the files in two collections that were created before the year 2019:

https://warcs.archive-it.org/wasapi/v1/webdata?collection=7013&collection=12131&crawl-time-before=2019

Or, to query the files created by crawl jobs than ran in these collections between May 1 and May 15, 2019:

https://warcs.archive-it.org/wasapi/v1/webdata?collection=7013&collection=12131&crawl-start-after=2019-05-01&crawl-start-before=2019-05-15

Derivative datasets

You may also use WASAPI to find and download the web archive datasets that were derived by Archive-It Research Services (ARS). Query by filetype for files in the Web Archive Transformation (WAT), Web Archive Named Entity (WANE), or Longitudinal Graph Analysis (LGA) formats:

https://warcs.archive-it.org/wasapi/v1/webdata?filetype=lga

The Archives Research Compute Hub (ARCH) superseded ARS in July 2024.

WASAPI queries can potentially match and return a great many responses, enough to crash a web browser or automated download process, so they are paginated (capped, effectively). By default, any query like the examples above will will return at most 100 results per page.

Every WASAPI query will first return a "count" value that indicates how many WARC files match the query parameters and a "next" location that indicates where to find the next page of responses:

Queries that match more than 100 WARC files may be appended with a higher page_size value and/or “resumed” over multiple pages.

For instance, the example query above matches more than 100 files, and so includes a “next” URL for retrieving the 101st through 200th matches:

https://warcs.archive-it.org/wasapi/v1/webdata?collection=6580&page=2

To instead query WASAPI for a single page with all of the matching WARC files:

https://warcs.archive-it.org/wasapi/v1/webdata?collection=6580&page_size=118 [or higher]

Troubleshooting

Decompression

Some partners have reported receiving error messages or creating new files with the .warc.open file format extension erroneously while attempting to decompress their Archive-It WARC files locally, depending upon the type and version of their operating system and local archiving utility.

Archive-It compresses WARC files in the GZIP format recommended by the WARC standard specification. It is not necessary to decompress these files in order to validate, move, index, or replay them. If you must decompress them, you may use the following command for built-in GZIP utilities on macOS and Linux or GNU Gzip:

gunzip {{warc-filename}}.warc.gz

Or to decompress all WARCs in the current directory:

gunzip *.warc.gz

More information

For more information about development and the data entities and querying options described in this article, see the WASAPI community's repository of specifications, documentation, and reports at https://github.com/WASAPI-Community/data-transfer-apis

Articles in this section

How to find and download your WARC files with WASAPI

Overview

On this page:

Data entities

Basic use

Querying

Downloading WARCs

Web browser

Command line

Advanced use

Advanced filtering

Date and time ranges

Derivative datasets

Troubleshooting

Decompression

More information

Comments

Articles in this section

Overview

On this page:

Data entities

Basic use

Querying

Downloading WARCs

Web browser

Command line

Advanced use

Advanced filtering

Date and time ranges

Derivative datasets

Pagination

Troubleshooting

Decompression

More information

Related articles