Partners may use Archive-It’s implementation of the Web Archiving Systems API (WASAPI) from a web browser or a command line terminal to find and download their WARC files and associated technical metadata. The API supports several advanced options for partners to find and download these files by collection, date and timespans, and other attributes described below.
Table of contents:
Data entities
Each response to a WASAPI query includes a count of the total number of matched WARC files and the following information about each of them.
Attribute | Explanation | Example |
account |
Unique ID number for your Archive-It account |
1036 |
checksums | Hexadecimal values for each file’s unique checksums |
md5: 06000c146149788c092b37bc2583c889 sha1: e83e2e3d8348ce72512916c44ec0afa4386875bb |
collection | Unique ID number for the Archive-It collection containing the file | 6850 |
crawl | Unique ID number for the crawl job that produced the file |
538557 |
crawl-start | Timestamp for the beginning of the crawl job that created the file |
2018-01-07T17:00:09Z |
crawl-time | Timestamp for the creation of the file | 2018-01-07T17:00:16Z |
store-time | Timestamp for the deposit of the WARC into storage |
2018-01-07T19:05:21Z |
filename | Name of the file in storage | ARCHIVEIT-6580-MONTHLY-JOB538557-20180107170016446-00000.warc.gz |
filetype | Format of the file in storage |
warc |
locations | URLs where the file can be downloaded from storage. There are two locations to reflect our primary and backup locations. |
https://warcs.archive-it.org/webdatafile/ARCHIVEIT-6580-MONTHLY-JOB538557-20180107170016446-00000.warc.gz https://archive.org/download/ARCHIVEIT-6580-MONTHLY-JOB538557-20180107-00000/ARCHIVEIT-6580-MONTHLY-JOB538557-20180107170016446-00000.warc.gz |
size | The file’s data volume in bytes | 52706601 |
WASAPI responds to queries with JSON objects by default, so the response record for https://archive.org/download/ARCHIVEIT-6580-MONTHLY-JOB538557-20180107-00000/ARCHIVEIT-6580-MONTHLY-JOB538557-20180107170016446-00000.warc.gz would appear in the browser as:
This json response is giving you the same information as in the chart above, as well as link to each of the matched WARCs.
Basic use
Querying
The first step is to query the database for the particular set of WARCs you are interested in downloading. In a web browser, use the general URL for all WASAPI queries to find all WARC files stored by your Archive-It account:
https://warcs.archive-it.org/wasapi/v1/webdata
The above URL may then be appended with further parameters after a ? operator in order to limit or filter results by the data attributes in the table above.
For instance, to find WARC files from only the Archive-It collection in the example above:
https://warcs.archive-it.org/wasapi/v1/webdata?collection=6580
Or, for all WARC files produced by just the single crawl job:
https://warcs.archive-it.org/wasapi/v1/webdata?crawl=538557
Downloading
An Archive-It WARC is no bigger than 1GB, so it's possible that a crawl of one seed could generate multiple WARCs. There are two ways to download WARCs via WASAPI: web browser, or command line. While the outcome is the same, the processes are different, each including multiple steps.
Web browser
Partners may use WASAPI in a web browser in order to manually download WARC files via the hyperlinked "locations" URLs in each response object. There are two locations to reflect the two copies of all WARCs- a primary and a backup. Either location can be used, but each link must be clicked individually to download the specific WARC (please see the below notes on pagination). Depending on the amount of WARCs, this can can be time consuming, so partners looking to bulk download WARCs may consider using an external bulk download tool.
In order to download all of the WARCs in a specific collection, use the URL https://warcs.archive-it.org/wasapi/v1/webdata?collection=6580. Then, either click on each link to download each WARC individually, or use an external bulk download tool.
Command line
It's possible to use the command line to automatically download WARCs and/or batch download WARCs. In order to do either option, it is necessary to first parse these locations from the responses with a JSON processor like jq, then to download with a retrieval tool like wget. Ensure that your machine has jq and wget, and then follow the below steps:
1. To create a list of storage locations for all WARC files from a WASAPI query:
curl -u <user>:<password> "https://warcs.archive-it.org/wasapi/v1/webdata?<your-parameters>" | jq -r .files[].locations[0] > url.list
2. To then download the files from their storage locations:
wget --http-user=<user> --http-password=<password> --accept txt,gz -i url.list
Use the advanced options below to further refine and specify the precise WARC files that you wish to download either from the web browser or command line.
Advanced use
Advanced filtering
WASAPI supports the use of multiple and repeatable filters. Archive-It partners may add additional filters after & operators in order to further define specific lists of WARC files across collections and/or by timespans.
For instance, to query files from multiple Archive-It collections at once:
https://warcs.archive-it.org/wasapi/v1/webdata?collection=6580&collection=6503
Date and time ranges
WASAPI supports querying files by date and time ranges by appending -before and/or -after to date and time attributes. These date and time entities conform RFC3339 specification (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS), but may also be abbreviated (for instance to include only YYYY-MM or YYYY).
For instance, to query only the files in two collections that were created before the year 2019:
https://warcs.archive-it.org/wasapi/v1/webdata?collection=7013&collection=12131&crawl-time-before=2019
Or, to query the files created by crawl jobs than ran in these collections between May 1 and May 15, 2019:
https://warcs.archive-it.org/wasapi/v1/webdata?collection=7013&collection=12131&crawl-start-after=2019-05-01&crawl-start-before=2019-05-15
Derivative datasets
You may also use WASAPI to find and download the web archive datasets that were derived by Archive-It Research Services (ARS). Query by filetype for files in the Web Archive Transformation (WAT), Web Archive Named Entity (WANE), or Longitudinal Graph Analysis (LGA) formats:
https://warcs.archive-it.org/wasapi/v1/webdata?filetype=lga
The Archives Research Compute Hub (ARCH) superseded ARS in July 2024.
Pagination
WASAPI queries can potentially match and return a great many responses, enough to crash a web browser or automated download process, so they are paginated (capped, effectively). By default, any query like the examples above will will return at most 100 results per page.
Every WASAPI query will first return a "count" value that indicates how many WARC files match the query parameters and a "next" location that indicates where to find the next page of responses:
Queries that match more than 100 WARC files may be appended with a higher page_size value and/or “resumed” over multiple pages.
For instance, the example query above matches more than 100 files, and so includes a “next” URL for retrieving the 101st through 200th matches:
https://warcs.archive-it.org/wasapi/v1/webdata?collection=6580&page=2
To instead query WASAPI for a single page with all of the matching WARC files:
https://warcs.archive-it.org/wasapi/v1/webdata?collection=6580&page_size=118 [or higher]
More information
For more information about development and the data entities and querying options described above, see the WASAPI community’s repository of specifications, documentation, and reports at: https://github.com/WASAPI-Community/data-transfer-apis
Comments
0 comments
Please sign in to leave a comment.