Partners may use Archive-It’s implementation of the Web Archiving Systems API (WASAPI) from a web browser or a command line terminal in order to find and download their WARC files and associated technical metadata. The API supports several advanced options for partners to find and download these files by collection, date and timespans, and other attributes described below.
Table of contents:
Each response to a WASAPI query includes a count of the total number of matched WARC files and the following information about each of them.
Unique ID number for your Archive-It account
|checksums||Hexadecimal values for each file’s unique checksums||
|collection||Unique ID number for the Archive-It collection containing the file||6850|
|crawl||Unique ID number for the crawl job that produced the file||
|crawl-start||Timestamp for the beginning of the crawl job that created the file||
|crawl-time||Timestamp for the creation of the file||2018-01-07T17:00:16Z|
|store-time||Timestamp for the deposit of the WARC into storage
|filename||Name of the file in storage||ARCHIVEIT-6580-MONTHLY-JOB538557-20180107170016446-00000.warc.gz|
|filetype||Format of the file in storage||
|locations||URLs where the file can be downloaded from storage||
|size||The file’s data volume in bytes||52706601|
WASAPI responds to queries with JSON objects by default, so a response record for the single WARC file referenced for example above appears as:
To find all WARC files stored by your Archive-It account, use the general URL for all WASAPI queries:
The above URL may then be appended with further parameters after a ? operator in order to limit or filter results by the data attributes in the table above.
For instance, to find WARC files from only the Archive-It collection in the example above:
Or, for all WARC files produced by just the single crawl job:
Partners may use WASAPI manually in a web browser in order to download WARC files via the hyperlinked "locations" URLs in each response object:
In order to perform downloads automatically and/or in batches from the command line, it is necessary to first parse these locations from the responses with a JSON processor like jq, then to download with a retrieval tool like wget.
1. To create a list of storage locations for all WARC files from a WASAPI query:
$ curl -u <user>:<password> https://warcs.archive-it.org/wasapi/v1/webdata?<your-parameters> | jq -r .files.locations > url.list
2. To then download the files from their storage locations:
$ wget --http-user=<user> --http-password=<password> --accept txt,gz -i url.list
Use the advanced options below to further refine and specify the precise WARC files that you wish to download either from the web browser or command line.
WASAPI supports the use of multiple and repeatable filters. Archive-It partners may add additional filters after & operators in order to further define specific lists of WARC files across collections and/or by timespans.
For instance, to query files from multiple Archive-It collections at once:
Date and time ranges
WASAPI supports querying files by date and time ranges by appending -before and/or -after to date and time attributes. These date and time entities conform RFC3339 specification (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS), but may also be abbreviated (for instance to include only YYYY-MM or YYYY).
For instance, to query only the files in two collections that were created before the year 2019:
Or, to query the files created by crawl jobs than ran in these collections between May 1 and May 15, 2019:
Derivative data jobs
WASAPI may be used additionally to request, monitor, and download the results of WARC data derivations, such as to the file formats supported by Archive-It Research Services (ARS). For instructions see: Request and download web archive derivatives with WASAPI.
WASAPI queries can potentially match and return a great many responses, enough to crash a web browser or automated download process, so they are paginated (capped, effectively). By default, any query like the examples above will will return at most 100 results per page.
Every WASAPI query will first return a "count" value that indicates how many WARC files match the query parameters and a "next" location that indicates where to find the next page of responses:
Queries that match more than 100 WARC files may be appended with a higher page_size value (up to 2000) and/or “resumed” over multiple pages.
For instance, the example query above matches more than 100 files, and so includes a “next” URL for retrieving the 101st through 200th matches:
To instead query WASAPI for a single page with all of the matching WARC files:
https://warcs.archive-it.org/wasapi/v1/webdata?collection=6580&page_size=118 [or higher]
For more information about development and the data entities and querying options described above, see the WASAPI community’s repository of specifications, documentation, and reports at: https://github.com/WASAPI-Community/data-transfer-apis