Longitudinal Graph Analysis (LGA) files contain a complete list of what URLs link to what URLs, along with a timestamp, within an entire web archive collection. They are web graph files that demonstrate the linking behaviors of all resources within the entirety of a web archive collection over time. By studying the longitudinal details of what webpages links to what other webpages, one can determine networks of influence, the formation and decline of web communities, determine hosts or domains of importance within a collection, or otherwise observe how websites interact with one another through their linking activity.
Table of contents:
Visualizing a network graph of top-level websites in the Human Rights collection
The above example is a Gephi timeline visualization of the top level websites in Columbia University's Human Rights collection, generated by using the LGA dataset. The visualization shows the dynamic linking behavior of over 17,000 websites over a period of 6 years. The edges between these websites are weighted by the number of unique links between them (only websites that share more than 100 unique links are represented). There are over 25,000 weighted edges representing an aggregate of over 4 billion unique links between these websites over time. The Gephi visualization allows one to explore community formation and website associations over time in the collection.
Plotting top image links within the Fashion Blogs collection
In this example, an LGA file from LIM College's Fashion Blogs cpollection is used to determine the 500 most-linked-to image URLs within the collection from four different time periods over a year, roughly corresponding to Spring, Summer, Fall, and Winter. Each set of 500 top images are then plotted according to brightness (x-axis) and hue (y-axis) using the ImagePlot visualization tool with the hopes that each season's final plot would correspond to common assumptions about seasonal fashion choices. It demonstrates how LGA files can support a variety of analytical methods beyond network graph analysis, by revealing the most-linked resources within an overall collection.
The LGA dataset encodes linking activity from any textual document in a collection with a 200 HTTP response code. LGA datasets are generated using a Hadoop-based production pipeline and Apache Pig, Java, and Python scripts. The LGA dataset will download as a compressed .tgz file containing two kinds of compressed .gz files:
ID-Map: Contains one line per each URL in a collection and assigns a UID (unique identifier) to each URL. Each line contains a JSON object with three fields: The URL's UID ("id"), the URL ("url") and the URL in SURT form ("surt_url"). Example ID-Map for five archived URLs:
ID-Graph: Each line contains a JSON object with three fields: The URL's UID ("id"), the timestamp associated with the capture of this URL ("timestamp"), and the set of the UIDs of the URLs linked to by this URL at that given timestamp ("outlink_ids"). Example ID-Graph for five archived URLs: