The full text of a web collection can be invaluable for a researcher. ARCH provides access to the full plain text found on websites. These files can be used alongside a number of different text analysis methods and techniques including sentiment analysis, NER, NLTK, word frequency, collocation, n-grams (word and phrase frequency over time), topic modelling, geoparsing, and word differences.
In this file, each website within a web archive collection will have its full text presented on one line, along with accompanying metadata.
- Queer Webcomics Archives Project: This was an Archives Unleashed Datathon project (Kritika Garg, Francis Kayiwa, Kae Bara Kratcha, and Wei Yin) which explored the representation of queer identies within the Global Web Comics Archive. A variety of methods and visualizations were produced while analyzing text and domains datasets.
- Autism Discourse in the U.S: An Exploratory Analysis: This Archives Unleashed datathon project explored the discourse among autism bloggers using two primary modes of analysis: sentiment analysis (NLTK) and network analysis (Gephi).
- #teamwildfyre: This Archives Unleashed datathon project explored the impact/severity of forest fires, as well as how information spreads and is broadcasted by media outlets. A variety of methods were used to analyze and visualize the full text dataset files, with a focus on NER and geocoding.
- Creating Collection Growth Curves with AUT and Hypercane: These blog posts descrive the process of using various tools to explore web archives collection growth curves. This can ultimately be "used to gain a better understanding of seed curation and the crawling behavior."
Example text sentiment analysis from the autism blogs web archive dataset
Plain text extractions are provided as CSV data with the following attributes for each page: crawl date, domain, URL, type (MIME and Tika), language, and content. See example.