ARCH collection files provide a glimpse of the domains present in a web archive collection. These may be especially useful in conjunction with other ARCH datasets.
#teamwildfyre. This Archives Unleashed datathon project used domain collection frequency files to identify and understand the origins of the data within a web archive collection.
Example domain count visualization from the UBC BC Wildfires web archive dataset
These jobs produce files that allow the user to explore domain related information and patterns:
- Domain frequency: The ARCH collection file is a CSV with the following columns: domain and count. See example.
- Web archive transformation (WAT) files include a brief header which identifies its corresponding URL via "WARC-Target-URI," corresponding W/ARC file via "WARC-Refers-To," and additional mapping information.