Introduction
ARCH collection files provide a glimpse of the domains present in a web archive collection. These may be especially useful in conjunction with other ARCH datasets.
Use Case
#teamwildfyre. This Archives Unleashed datathon project used domain collection frequency files to identify and understand the origins of the data within a web archive collection.
Example domain count visualization from the UBC BC Wildfires web archive dataset
Technical Details
These jobs produce files that allow the user to explore domain related information and patterns:
- Domain frequency: The ARCH collection file is a CSV with the following columns: domain and count. See example.
- Web archive transformation (WAT) files include a brief header which identifies its corresponding URL via "WARC-Target-URI," corresponding W/ARC file via "WARC-Refers-To," and additional mapping information.
Comments
0 comments
Please sign in to leave a comment.