Web Archive Named Entities (WANE) files contain the named entities from each text resource in a web archive collection, organized by originating URL and timestamp. They enable researches to track, the people, places, and organizations mentioned in a collection over time. Extracting named entities from a web archive collection also enhances opportunities for discovery and sharing through linked data.
Table of contents:
Use cases
Named entities in the Human Rights collection
For the example above, WANE data was derived from one month (October 2014) of crawls in Columbia University's Human Rights collection, representing more than 300GB of web data.
Top entities in the Ferguson collection
WANE data was generated from four months of crawls in the Internet Archive's collaborative collection of URLs related to events in Ferguson, MO. The top ten person names, among over 650,000 total in the collection, highlighted both expected and unexpected results.
Top Entities in the Chicago Architecture Biennial collection
Web Archivist Karl Blumenthal used WANE data to determine the most discussed designers among press and social media coverage of the first Chicago Architecture Biennial. In this blog post he describes how the results were achieved and what counter-narratives they propose to contemporaneous press coverage.
Technical details
The WANE dataset encodes entities from any textual document in a collection with a 200 HTTP response code. WANE datasets are generated using a Hadoop-based production pipeline that includes use of the Stanford Named Entity Recognizer (NER), Apache Pig, Java, and Python scripts. Downloaded WANE datasets will map one-to-one to original W/ARC files and will be similarly packed as concatenated, compressed records. (0-byte WANE files are therefore possible in the case that a corresponding W/ARC file has no recognizable named entities).
And example WANE record corresponding to this archived webpage:
{"url":"http://dissonantwinstonsmith.wordpress.com/2014/08/24/im-sick-of/?like_comment=79&_wpnonce=0fc57aa499&replytocom=93","timestamp":"20141019212346","named_entities":{"locations":["North County","America","St. Louis County St. Louis County Police St. Louis County","St. Louis","WordPress.com","Middle East"],"organizations":["Dissonant Winston Smith Dissonant Winston Smith Menu Skip","Twitter Facebook Google","Google","Facebook","Wal-Mart","CNN","Bearcats"],"persons":["Stell","Tom Jackson","Smith","Pamela Fillingim","Darren Wilson Eric Fowler Eric Vickers Ferguson Ferguson","Ferguson","Rob Crawford","Kley","Erin Miller","darren wilson","Mike","Daniel Garrelts","Darren Wilson","Rath","Ellis Wyatt","Nick","Wilson","Mike Browns","Trayvon","Jane Jacoby","Kley Potter","Mike Brown","Michael","Michael Brown","Angela","Pablo","Jon Stewart","George Zimmerman Jamilah Nasheed KTVI","mike brown","Heather","Pamela fillingim","pamela fillingim","Susan"]},"digest":"sha1:747IKFWUCVQVXY7TX2NMYFL422T4TRQX"}
Comments
0 comments
Please sign in to leave a comment.