Site link structures can be very useful, allowing you to learn such things as: which websites have the most in-bound or out-bound links; which paths can be taken through the networks that connect pages; and which communities exist within the same link structure.
ARCH generates four types of network files with columnar data. These can then be loaded into network analysis programs such as Gephi or NodeXL. You could also use these files with various R or Python-based software packages. Through these files, you can understand how websites or domains have linked to each other over time, as well as how hyperlinked images have been used.
- Network Analysis of the UK Government Web Archives. A team of researchers at a National Archives workshop asked the research question, “how is the government web linked together at different points in time, and how might this have changed over the last 10 years?” This blog post explores their findings and the process of using Gephi for analysis.
- Contemporary Composers Web Archives. An Archives Unleashed Datathon project team used the domain graph dataset and Gephi to explore connection strength, community, and dominant nodes among websites.
Example domain graph visualization from the Contemporary Composers web archive dataset
Networks are composed of "edges" (the hyperlinks between pages) and "nodes" (the webpages, images, or domains). Data is provided in the CSV file format for three network datasets:
- Extract Domain Graph: A CSV with the following columns: crawl date, source domain, destination domain, and count. See example. (Note that these are domain-to-domain links).
- Extract Image Graph: A CSV file with the following columns: crawl date, source (where each image was hosted), the URL of the image, and the alternative text of the image. See example. (Note that these files do not include the archived image files themselves).
- Extract Web Graph: A CSV file with the following columns: crawl date, source, destination, and anchor text. See example. (Note that this file contains all links and is not aggregated into domains)..
- Extract longitudinal graph: Creates Longitudinal Graph Analysis (LGA) files which contain a complete list of what URLs link to what URLs, along with a timestamp.
Gephi is a well-known and established open-source visualization and exploration software for graphs and networks. Using the datasets from ARCH, a researcher can explore and conduct link and social network analyses. The project offers a variety of guides for using Gephi features.
In addition, the Archives Unleashed Project has created an Introduction to Gephi tutorial, which provides a quick introduction to transforming the ARCH Domain Graph dataset into a network visualization through Gephi.