ARCH (Archive Research Compute Hub) provides expanded analysis functionality for Archive-It collections. This add-on service is intended for research use cases that go beyond the standard Wayback replay and keyword search tools provided for all collections. NB: Currently, ARCH is in beta stage and is not yet widely available. Please read our blog post to learn more about this service and keep your eye out for future announcements via the Archive-it newsletter.
ARCH transforms web archival collections into "datasets" for further research. These include domain frequency statistics, hyperlink network graphs, extracted full-text, and metadata about binary objects within a collection. Through this process, Archive-It partners' WARC files are transformed into accessible scholarly objects. Users can also explore several in-browser visualizations.
ARCH has been supported by a generous grant from The Andrew W. Mellon Foundation and represents the product of a collaborative partnership among the Internet Archive, the University of Waterloo, and York University.
Table of contents:
ARCH currently provides thirteen datasets for analysis, broken down into four functional categories. Descriptions are also provided in the user interface for reference.
ARCH collection files provide a glimpse of the domains present in a web archive collection. These may be especially useful in conjunction with other ARCH datasets. Learn more...
ARCH network files represent how websites or domains have linked to each other over time, as well as how hyperlinked images have been used. Learn more...
ARCH full text datasets present the full text of the sites in web archives on one line each, along with accompanying metadata and other web elements including html, css, js, json, and xml. Learn more...
File formats ⇣
ARCH binary information datasets contain information on certain types of binary files found within web archive collections, including PDFs, audio, images, PowerPoints, spreadsheets, video, and Word documents. Learn more...
How to create and view datasets
ARCH is offered as an add-on subscription for current Archive-It partners and is also available to independent researchers patrons upon request. Once activated within an Archive-It account's web application, the "ARCH" tab will appear in the top navigational menu bar:
Collection Analysis Page
The ARCH tab presents a list of collections, provides basic information about the most recent analysis conducted, and other collection-based metadata. Table headers can be used to sort columns.
Click on the collection name to access its summary page.
This will bring you to the core of ARCH, where datasets can be generated, downloaded, and their status monitored. The Job Summary tab offers an overview of jobs being processed and those that have completed.
The collection overview provides basic metadata about the collection including its size and whether it is a public or private collection.
The Jobs in Process table identifies the stage and queue of any current jobs being run. If a job is in the queue, the table also notes its position.
The Completed Jobs table provides a summary of all the datasets that have been generated, noting an accompanying date/time stamp. Click on the job name to access an overview of the dataset along with metadata, download options, a preview, and an option to re-run the job.
Create datasets ⇣
To start a new job, you can either select “Start new job” or you can click on the Generate Datasets tab. When you do so, you can expand and browse each category, and the begin generating datasets.
NB: Each dataset presents two options. We strongly recommend running an Example Job first, especially when working with large collections (100GB+). This generates a small dataset. Reviewing this dataset should ensure that the analysis produced the desired kind of data. Accordingly, it also generates data far more quickly than does the Generate Dataset option.
- Generate Example Dataset: Generates a small dataset using the first 100 relevant records from the collection.
- Generate Dataset: Generates a large dataset using all of the records in your collection.
Once a job is initiated, the button will change to Running. When complete, click the View Example Dataset or View Dataset button. You can generate multiple datasets at a time and can navigate away from the page while the job is processed.
Remember that analysis is not instant. The larger the collection, the more time that datasets will require to be created. Some example jobs may complete in a few minutes, while some full jobs may take days to complete. In other words, the larger the collection, the more time needed to process and create the dataset. Additionally, keep in mind that multiple ARCH users can be generating datasets simultaneously.
You will receive an email when your job is complete and the dataset is ready to use. The email will be sent to the address tied to your Archive-It account.
View the results ⇣
Many dataset pages in the web application include visualizations to represent the contents of their downloadable files:
For example, a domain graph page (above) presents an interactive network graph in which you can see nodes (domains) and edges (links) among the 100 top domain nodes with the greatest number of inbound and outbound links in a collection.
Note that these are designed to provide a quick overview of the collection. For in-depth examination, you will need to download the datasets and work with additional analysis tools.
Some dataset visualizations may have limited data to graph, so don't be alarmed if your network graph only shows one or two nodes. Graphs provided through ARCH are determined by the data available in the web archive collection and sample available. Unfortunately, not all web collections are conducive to graphics elements such as charts and graphs.
Download datasets ⇣
The most important part of this page is the ability to download your dataset.
The ARCH interface provides basic information about the generated dataset, including the file size and result count, which provides a number of lines found within the dataset file.
Most users will download files directly through the web browser. This method is best for a few, small files (less than 2GB), and/or one-time downloads. Simply click on the download icon.
However, larger datasets and computational workflows may instead require using a command line application such as wget or curl. To do so, right click on the download icon, click copy link, and paste into your command line interface using wget, curl, or the network transfer utility of your choosing.
NB: There is a known bug in more recent versions of macOS that interfere with opening some of our datasets downloaded through the browser. If you are on macOS, you will see a warning asking you to use The Unarchiver to open the ZIP files.
Additional features ⇣
ARCH also provides a preview of up to 100 lines of data from the csv file. The preview offers researchers a glimpse of the content found in the csv without having to download and open.
The example below identifies the preview of the extract domain graph dataset.
In the event that a collection has run additional crawls, you can re-run any jobs to include any new records. When a dataset job is re-run, the existing derivatives will be permanently deleted and replaced with the most recently run job.
The Help widget, located at the bottom of ARCH, can be used to search terms and concepts throughout Archive-It documentation, including information related directly to ARCH.
General Tool Recommendations
Working with web archives at scale can be challenging. While ARCH provides users with a starting point through generating datasets, external tools will be needed to further explore and investigate web archive collections.
The Following tools provide foundational knowledge, skills, and processes necessary for working with large datasets computationally.
The command-line interface can seem daunting, but it is especially helpful for conducting analysis. The Programming Historian has published peer-reviewed tutorials which focus on digital tools and techniques to help support digital humanities research.
Each dataset page listed in the menu provides, where applicable, a tools section dedicated to provide further details around suggested tools and techniques.
For command line information, we recommend the following two tutorials:
- Introduction to the Windows Command Line with PowerShell (for Windows)
- Introduction to the Bash Command Line (for Mac/Linux)
While there are several methods, data cleaning is an important aspect of the research process. OpenRefine is an open-source tool for working with messy data: from cleaning to transforming formats to extending it with web services and external data.
- Programming Historian, Cleaning Data with OpenRefine
Widely adopted in many fields, the Python programming language is foundational in data science, particularly helpful for data mining, processing, analytics and visualization. We have created an example Python notebook for working with ARCH derivatives, and we suggest the following resources for getting started with entry-level Python uses: