ARCH (Archives Research Compute Hub) provides expanded analysis functionality for Archive-It collections. This add-on service is intended for research use cases that go beyond the standard Wayback replay and keyword search tools provided for all collections.
The ARCH platform transforms web archival collections into "datasets" for further research. These include domain frequency statistics, hyperlink network graphs, extracted full-text, and metadata about binary objects within a collection. Through this process, Archive-It partners' WARC files are transformed into accessible scholarly objects.
- Transform Archive-It collections into research datasets for analysis
- In-browser preview of files to ensure suitability for projects
- Direct browser downloads of datasets
- Standardized dataset format as CSVs
- Explore in-browser visualizations, graphs, and charts to understand web archival collections
ARCH has been supported by a generous grant from The Andrew W. Mellon Foundation and represents the product of a collaborative partnership among the Internet Archive, the University of Waterloo, and York University.
NB: Currently, ARCH is in beta stage and is not yet widely available. Please read our blog post to learn more about this service and keep your eye out for future announcements via the Archive-it newsletter.
Table of contents:
ARCH currently provides datasets for analysis, broken down into four functional categories. Detailed descriptions for each dataset are provided in the user interface for reference.
ARCH collection files provide a glimpse of the domains present in a web archive collection. These may be especially useful in conjunction with other ARCH datasets. Learn more...
ARCH network files represent how websites or domains have linked to each other over time, as well as how hyperlinked images have been used. Learn more...
ARCH full text datasets present the full text of the sites in web archives on one line each, along with accompanying metadata and other web elements including HTML, CSS, JS, JSON, and XML. Learn more...
File formats ⇣
ARCH file format datasets contain information on certain types of binary files found within web archive collections, including PDFs, audio, images, PowerPoints, spreadsheets, video, and Word documents. Learn more...
How to create and view datasets
ARCH is offered as an add-on subscription for current Archive-It partners and is also available to independent researchers upon request. Once activated within an Archive-It account, the "ARCH" tab will appear in the top navigational menu bar:
Collection Analysis Page
Once logged in, ARCH presents a list of collections and provides basic information about the most recent analysis conducted and other collection-based metadata. Table headers can be used to sort columns.
Click on the collection name to access its summary page.
This will bring you to the core of ARCH, where datasets can be generated, downloaded, and their status monitored. The Job Summary tab offers an overview of datasets being processed and those that have been generated and are ready for review and download.
The collection overview provides basic metadata about the collection including the number of seeds, last crawl date, size and whether it is a public or private collection.
The Jobs in Process table only appears when datasets are being processed and identifies the type of datasets being generated and its status.
The Completed Jobs table provides a summary of all the datasets that have been generated, noting an accompanying date/time stamp. Click on the job name to access an overview of the dataset along with metadata, download options, a preview, and an option to re-run the job.
Create datasets ⇣
To create a dataset, you can either select “Generate a new dataset," which appears below the collection overview dashboard, or you can click on the Generate Datasets tab. Each category can be expanded to browse and select datasets you wish to generate.
NB: Each dataset presents two options. We strongly recommend running an Example Job first, especially when working with large collections (100GB+). This generates a small dataset, which should be reviewed to ensure the analysis produces the desired kind of data. Accordingly, it also generates data far more quickly than does the Generate Dataset option.
- Generate Example Dataset: Generates a small dataset using 100 relevant records from the collection.
- Generate Dataset: Generates a large dataset using all of the records in your collection.
Once Generate Dataset is initiated, the button will change to Running. You can generate multiple datasets at a time and can navigate away from ARCH while the datasets are being processed. When the dataset is ready, click either the View Example Dataset or View Dataset button. You can also view the datasets' status by returning to the Job Summary page.
Remember that analysis is not instant. The larger the collection, the more time needed to process and create the dataset. Some example jobs may complete in a few minutes, while some full jobs may take days to complete. Additionally, keep in mind that multiple ARCH users can be generating datasets simultaneously, which can impact processing times.
You will receive an email when your job is complete and the dataset is ready to use. The email will be sent to the address tied to your Archive-It account.
You may run into an instance where generating a dataset has failed, which is identified directly on the dataset button and a message is displayed at the top of the “Generate Datasets” page. No action is required on your end as our team will investigate and a notification will be delivered to your email address tied to your Archive-It account once resolved.
View the results ⇣
Each generated dataset has an accompanying results page, which provides a variety of features that allow for further interaction with the datasets. Where possible, visualizations, charts, and graphs are present to visually summarize the contents of their downloadable files:
For example, a domain graph dataset (above) presents an interactive network graph in which you can see nodes (domains) and edges (links) among the 100 top domain nodes with the greatest number of inbound and outbound links in a collection.
Note that these are designed to provide a quick overview of the collection. For in-depth examination, you will need to download the datasets and work with additional analysis tools.
Some dataset visualizations may have limited data to graph, so don't be alarmed if your network graph only shows one or two nodes. Graphs provided through ARCH are determined by the data available in the web archive collection. Unfortunately, not all web collections are conducive to graphics elements such as charts and graphs.
Download datasets ⇣
The most important part of this page is the ability to download your dataset.
The ARCH interface provides basic information about the generated dataset, including the file size and result count, which provides a number of lines found within the dataset file.
Most users will download files directly through the web browser. This method is best for a few, small files (less than 2GB), and/or one-time downloads. Simply click on the download icon. Just a reminder, you may need to adjust your browser settings to allow for this download.
Downloading the WAT, WANE, and LGA datasets is different as they require use of WASAPI Data Transfer APIs. Follow the download instructions as found on those dataset pages.
However, larger datasets and computational workflows may instead require using a command line application such as wget or curl. To do so, right click on the download icon, click copy link, and paste into your command line interface using wget, curl, or the network transfer utility of your choosing.
NB: There is a known bug in more recent versions of macOS that interfere with opening some of our datasets downloaded through the browser. If you are on macOS, you will see a warning asking you to use The Unarchiver to open the ZIP files.
Additional features ⇣
ARCH also provides a preview of up to 100 lines of data from the csv file. The preview offers researchers a glimpse of the content found in the csv without having to download and open.
The example below identifies the preview of the extract domain graph dataset.
In the event that a collection has run additional crawls, you can re-run the dataset to include new records. When a dataset job is re-run, the existing derivatives will be permanently deleted and replaced with the most recently run job.
The Help widget, located at the bottom of the ARCH interface, can be used to search terms and concepts throughout Archive-It documentation, including information related directly to ARCH.
General Tool Recommendations ⇣
Working with web archives at scale can be challenging. While ARCH provides users with a starting point by generating datasets, external tools will be needed to further explore and investigate web archive collections.
The following tools provide foundational knowledge, skills, and processes necessary for working with large datasets computationally. A dedicated tools section is provided on each dataset information page in this documentation to share further details around suggested tools and techniques.
The command-line interface can seem daunting, but it is especially helpful for conducting analysis. The Programming Historian has published peer-reviewed tutorials (in four languages) which focus on digital tools and techniques to help support digital humanities research.
For command line information, we recommend the following two tutorials:
- Introduction to the Windows Command Line with PowerShell (for Windows)
- Introduction to the Bash Command Line (for Mac/Linux)
While there are several methods, data cleaning is an important aspect of the research process. OpenRefine is an open-source tool for working with messy data: from cleaning to transforming formats to extending it with web services and external data.
- Programming Historian, Cleaning Data with OpenRefine
Widely adopted in many fields, the Python programming language is foundational in data science, particularly helpful for data mining, processing, analytics and visualization. We have created an example Python notebook for working with ARCH datasets, and we suggest the following resources for getting started with entry-level Python uses: