Site link structures can be very useful, allowing you to learn such things as: which websites have the most in-bound or out-bound links; which paths can be taken through the networks that connect pages; and which communities exist within the same link structure.
ARCH generates three types of network files with columnar data that can be loaded into network analysis programs such as Gephi, NodeXL, or various software packages to use with R or Python. Through these files you can understand how websites or domains have linked to each other over time, as well as how hyperlinked images have been used.
- Network Analysis of the UK Government Web Archives. A team of researchers at a National Archives workshop asked the research question, “how is the government web linked together at different points in time, and how might this have changed over the last 10 years?” This blog post explores their findings and the process of using Gephi for analysis.
- Contemporary Composers Web Archives. An Archives Unleashed Datathon project team used the domain graph dataset and Gephi to explore connection strength, community, and dominant nodes among websites.
Example domain graph visualization from the Contemporary Composers web archive dataset
Networks are composed of "edges" (the hyperlinks between pages) and "nodes" (the webpages, images, or domains). Data is provided in the CSV file format for three network file types:
- Extract Domain Graph: A CSV with the following columns: crawl date, source domain, destination domain, and count. See example. (Note that these are domain-to-domain links).
- Extract Image Graph: A CSV file with the following columns: crawl date, source (where each image was hosted), the URL of the image, and the alternative text of the image. See example. (Note that these files do not include the archived image files them selves).
- Extract Web Graph: A CSV file with the following columns: crawl date, source, destination, and anchor text. See example. (Note that this file contains all links and is not aggregated into domains).