The full text of a web collection can be invaluable for a researcher. ARCH provides access to the full plain text found on websites. These files can be used alongside a number of different text analysis methods and techniques including sentiment analysis, NER, NLTK, word frequency, collocation, n-grams (word and phrase frequency over time), topic modelling, geoparsing, and word differences.
In this file, each website within a web archive collection will have its full text presented on one line, along with accompanying metadata.
- Queer Webcomics Archives Project: This was an Archives Unleashed Datathon project (Kritika Garg, Francis Kayiwa, Kae Bara Kratcha, and Wei Yin) which explored the representation of queer identies within the Global Web Comics Archive. A variety of methods and visualizations were produced while analyzing text and domains datasets.
- Autism Discourse in the U.S: An Exploratory Analysis: This Archives Unleashed datathon project explored the discourse among autism bloggers using two primary modes of analysis: sentiment analysis (NLTK) and network analysis (Gephi).
- #teamwildfyre: This Archives Unleashed datathon project explored the impact/severity of forest fires, as well as how information spreads and is broadcasted by media outlets. A variety of methods were used to analyze and visualize the full text dataset files, with a focus on NER and geocoding.
- Creating Collection Growth Curves with AUT and Hypercane: These blog posts describe the process of using various tools to explore web archives collection growth curves. This can ultimately be "used to gain a better understanding of seed curation and the crawling behavior."
Example text sentiment analysis from the autism blogs web archive dataset
These jobs produce files that allow the user to explore text components of a web archive, including extracted plain text as well as raw HTML and other web elements.
- Extract plain text of webpages: create a CSV with the following columns: crawl date, web domain, URL, MIME type as provided by the web server, MIME type as detected by Apache TIKA, and content. See example.
- Extract text files (html, text, css, js, json, xml) information: create a CSV with the following columns: crawl date, URL of the text file, filename, text extension, MIME type as provided by the web server, MIME type as detected by Apache TIKA, text file MD5 hash and text file SHA1 hash, and text file content. There are six CSV files available for the following information: css, json, xml, plain text, js, and html.
Working with the volume of text that you will receive from a web archive will be challenging. There is no easy way around this.
Often, you will need to filter your dataset down to a manageable size. There are a number of approaches you can take to do this, but we recommend grep on the command line or pandas in Python as two ways to take your large CSV file and make it more manageable depending on the research query you have.
Using grep, for example, you could use the following commands:
To return a text file that only has full-text rows from May 22, 2016:
grep '^20160522' web-pages.csv > 20160522-text.csv
grep ',rmwb.ca,' web-pages.csv > RMWB-text.csv
grep -i 'helicopter' web-pages.csv > helicopter-text.csv
- Melanie Walsh, Introduction to Cultural Analytics and Python
- Voyant is a free web-based text analysis platform, which can quickly and easily visualize data and export the visualizations for further use.
- Programming Historian, Corpus Analysis with Antconc: Corpus analysis is a form of text analysis that allows you to make comparisons between textual objects at a large scale (so-called 'distant reading').
- Programming Historian, Getting Started with Topic Modeling and MALLET: This lesson provides an overview of topic modeling and why you might want to employ it in your research. The tutorial walks through how to install and work with the MALLET natural language processing toolkit.