Archive-It Research Services (ARS) expands the ways that Archive-It partners can enable access to their archives by deriving additional and specialized datasets from their collections. The add-on service enables partners to provide datasets that contain the key metadata elements, link graphs, named entities, and/or other data derived from the resources within their collections. By supporting access in aggregate to partner archives, ARS facilitates new types of use, research, and analysis of the significant historical records from the web that Archive-It partners collect, preserve, and make accessible.
The supporting documentation below describes the types of datasets currently available, how to request and retrieve them, and example use cases and types of analyses that these datasets enable.
Table of contents:
ARS currently provides the following types of derivative web archive datasets. Follow the links to more information about and use cases for each:
- Web Archive Transformation (WAT): Key WARC file metadata header elements from every crawled resource in a web archive collection.
- Web Archive Named Entities (WANE): People, places, and organizations mentioned in each timestamped URI within a collection.
- Longitudinal Graph Analysis (LGA): Complete and timestamped list of URIs' links to other URIs within an entire collection.
How to get ARS datasets
ARS dataset derivation is offered as an add-on subscription for current Archive-It partner institutions and is also available to independent researchers and unaffiliated patrons upon request. Partners may use WASAPI or the "ARS" section of the Archive-It web application to request an initial dataset from their web archive collection of choice at no cost:
Subscription costs for regular ARS henceforth cover the machine, engineering, and labor expenses of generating, storing, and delivering the datasets. Like existing web archiving partnerships, ARS subscriptions renew annually and include Archive-It storage and preservation. There is no limit on the number of collections that may be included for data derivation in an ARS subscription.
Independent researchers interested in a combined Archive-It account for web archiving that includes the creation of ARS datasets are encouraged to contact us.
Once the service has begun, it will take approximately one-to-two weeks for WATs and WANEs to be first available for download from an Archive-It collection. Subscribers will be emailed when WATs and/or WANEs first become available, with directions to access them directly. They will thereafter be generated in parallel to WARC files for the ongoing term of the subscription period. LGA files take 4 weeks to first be available and are thereafter generated quarterly. Subscribers will be emailed when each quarterly LGA dataset is available. Monthly derivation of LGA datasets is possible at an additional cost.
How to use ARS datasets
As structured JSON files, ARS datasets may be parsed and analyzed by myriad programs and scripts. For examples and to practice accessing and analyzing ARS datasets, follow the ARS Workshop here.