ARCH provides expanded analysis functionality for Archive-It collections. This add-on service is intended for research use cases that go beyond the standard Wayback replay and keyword search tools provided for all collections. NB: Currently, ARCH is in beta stage and is not yet widely available. Please read our blog post to learn more about this service and keep your eye out for future announcements via the Archive-it newsletter.
ARCH transforms web archival collections into "datasets" for further research. These include domain frequency statistics, hyperlink network graphs, extracted full-text, and metadata about binary objects within a collection. Through this process, Archive-It partners' WARC files are transformed into accessible scholarly objects. Users can also explore several in-browser visualizations.
ARCH has been supported by a generous grant from The Andrew W. Mellon Foundation and represents the product of a collaborative partnership among the Internet Archive, the University of Waterloo, and York University.
Table of contents:
ARCH currently provides twelve datasets for analysis, broken down into four functional categories. Descriptions are also provided in the user interface for reference.
ARCH collection files provide a glimpse of the domains present in a web archive collection. These may be especially useful in conjunction with other Arch datasets. Learn more...
ARCH network files represent how websites or domains have linked to each other over time, as well as how hyperlinked images have been used. Learn more...
ARCH full text datasets present the full text of the sites in web archives on one line each, along with accompanying metadata. Learn more...
File formats ⇣
ARCH binary information datasets contain information on certain types of binary files found within web archive collections, including PDFs, audio, images, PowerPoints, spreadsheets, video, and Word documents. Learn more...
How to create and view datasets
ARCH is offered as an add-on subscription for current Archive-It partners and is also available to independent researchers patrons upon request. Once activated within an Archive-It account's web application, the "Datasets" tab will appear in the top navigational menu bar:
Collection Analysis Page
The datasets tab presents a list of collections, provides basic information about the most recent analysis conducted, and other collection-based metadata.
Click on the collection name to access its collection page.
This will bring you to the core of ARCH, where datasets can be generated, downloaded, and their status monitored. The Summary tab offers an overview of running and finished jobs.
The Jobs in Process table identifies the stage and state of any current jobs being run. If a job is in the queue, the table also notes its position.
The Completed Jobs table provides a summary of all jobs completed and notes an accompanying date/time stamp. Click on the job name to access an overview of the dataset along with download options.
Create datasets ⇣
To start a new job, you can either select “Start new job” or you can click on the Jobs tab. When you do so, you can expand and browse each category, and the begin generating datasets.
NB: Each dataset presents two options. We strongly recommend running a Sample Job first, especially when working with large collections (100GB+). This generates a small downloadable dataset to ensure that the analysis produces the desired kind of data. Accordingly, it also generates data far more quickly than does the Run Job option.
- Run Sample Job: Generates a small dataset, using the first 100 records from the collection.
- Run Job: Generates a large dataset, using all of the records in your collection.
Once a job is activated, the button will change to Running. When complete, click the View Sample Results or View Results option.
Remember that analysis is not instant. The larger the collection, the more time that datasets will require to be created. Some sample jobs may complete in a few minutes, while some full jobs may take days to complete. In other words, the larger the collection, the more time needed to process and create the dataset.
You will receive an email notification when your dataset is ready to view.
View the results ⇣
Many dataset pages in the web application include visualizations to represent the contents of their downloadable files:
For example, a domain graph page (above) presents an interactive network graph in which you can see nodes (domains) and edges (links) among the 100 top domain nodes with the greatest number of inbound and outbound links in a collection.
Note that these are designed to provide a quick overview of the collection. For in-depth examination, you will need to download the datasets and work with additional analysis tools.
Download datasets ⇣
The most important part of the dataset page is the ability to download the dataset.
There are two main ways to download files:
- Web browser: Download the file directly through the web browser. This method is best for few, small (less than 2GB), and/or one-time downloads. Larger datasets and computational workflows may instead require:
- Command line: The web application provides the necessary curl and wget commands to download the datasets automatically. To use, click on the copy icon, which will copy the given command to your clipboard. This can then be pasted in your command line interface.
NB: There is a known bug in more recent versions of MacOS that interfere with opening some of our datasets downloaded through the browser. If you are on MacOS, you will see a warning asking you to use The Unarchiver to open the ZIP files.
Note that the size of the download is displayed in the download bar.
ARCH also provides a preview of up to 25 lines of data from the csv file. The preview offers researchers a glimpse of the content found in the csv without having to download and open.
The example below identifies the preview of the extract domain graph dataset.