Derivative data sets are specific types of files built from metadata, provenance information, entities, links, and other key elements of archived web resources. The benefit to working with derivative datasets is that they are much smaller than Web Archive (WARC) files, because they only contain specific pieces of metadata and not the archival web content itself. Currently, we offer the ability to request and download WAT, WANE and LGA files.
WAT
WAT stands for Web Archive Transformation, and are composed of key metadata such as provenance/capture information, essential text and link data, and other information. They are extracted from WARCs for every resource; because WAT files map one-to-one to WARC files, a collection's WARC files will have corresponding WAT files. WAT formats metadata into JavaScript Object Notation (JSON). The benefit is WATs are around 5%-20% the size of corresponding WARCs.
LGA
Longitudinal Graph Analysis files are archival web graph files that include a complete list of what URIs link to what URIs, along with a timestamp, from a collection’s origin through present. They are ~1% the size of a collection's aggregate WARC files, and deliver as a ZIP container of two files:
- ID-Map:
- Contains one line per each URL in a collection and assigns a UID (unique identifier) to each URL.
- Each line contains a JSON object with three fields: The URL's UID ("id"), the URL ("url") and the URL in SURT form ("surt_url")
- ID-Graph:
- Each line contains a JSON object with three fields: The URL's UID ("id"), the timestamp associated with the capture of this URL ("timestamp"), and the set of the UIDs of the URLs linked to by this URL at that given timestamp ("outlink_ids")
WANE
Web Archive Named Entities are files that use named-entity recognition tools to generate a list of all the people, places, and organizations mentioned in each URI in a web archive, with a timestamp of when the URI was captured. The purpose is to link people, places, and organizations to time. A WANE dataset is generated using the Stanford Named Entity Recognizer software (http://nlp.stanford.edu/software/CRF-NER.shtml) to extract named entities from each textual resource in a collection. The analyzer uses an English model 3-class classifier to extract names that correspond to recognized Persons, Organizations, and Locations. WANE files are less than 1% the size of their corresponding WARC files, and are structured as a JSON object per line: URL ("url"), timestamp ("timestamp"), content digest ("digest") and the named entities ("named_entities") containing data arrays of "persons", "organizations", and "locations".
Comments
0 comments
Please sign in to leave a comment.