Introduction
ARCH file format datasets contain information on certain types of binary files found within web archive collections, including PDFs, audio, images, PowerPoints, spreadsheets, video, and Word documents.
Use Case
Non-textual content in the DC Punk web archive. This Archives Unleashed Datathon project explored non-textual elements, specifically audio and video objects. Project members used data to download images and explore file type frequencies.
Example gallery projection of images extracted from the DC Punk web archive dataset
Technical Details
Dataset files are divided among the main types of binary file:
- Extract audio information: Create a CSV with the following columns: crawl date, URL of the audio file, filename, audio extension, MIME type (as provided by the web server and as detected by Apache TIKA), audio MD5 hash, and audio SHA1 hash. See example.
- Extract image information: Create a CSV with the following columns: crawl date, URL of the image, filename, image extension, MIME type (as provided by the web server and as detected by Apache TIKA), image width, image height, image MD5 hash, and image SHA1 hash. See example. (This dataset does not extract all the images from the web archive collection, but provides metadata about them for further analysis. You could, for instance, use the list of image URLs to download the images via command-line.)
- Extract PDF information: Create a CSV with the following columns: crawl date, URL of the PDF file, filename, PDF extension, MIME type (as provided by the web server and as detected by Apache TIKA), PDF MD5 hash, and PDF SHA1 hash. See example.
- Extract PowerPoint information: Create a CSV with the following columns: crawl date, URL of a PowerPoint or similar file, filename, PowerPoint or similar file extension, MIME type (as provided by the web server), MIME type (as detected by Apache TIKA), PowerPoint or similar file MD5 hash, and PowerPoint or similar file SHA1 hash. See example. This dataset is able to generate information on a variety of presentation program files including .ppt, .odp, and .key.
- Extract spreadsheet information: Create a CSV with the following columns: crawl date, URL of the spreadsheet file, filename, spreadsheet extension, MIME type (as provided by the web server), MIME type (as detected by Apache TIKA), spreadsheet MD5 hash, and spreadsheet SHA1 hash.
- Extract video information: Create a CSV with the following columns: crawl date, URL of the video file, filename, video extension, MIME type (as provided by the web server and as detected by Apache TIKA), video MD5 hash, and video SHA1 hash. See example.
- Extract Word Documents documents information: Create a CSV with the following columns: crawl date, URL of the Word document or similar file, filename, word processor program extension, MIME type (as provided by the web server and as detected by Apache TIKA), word processor program MD5 hash, and word processor program SHA1 hash. See example. This dataset generates information from a variety of word processing document types, including .doc, .odt, .rtf, and .wpd.
Tool Recommendations
These datasets do not give you the binary files themselves, but rather contextual metadata about them. However, you can use the dataset to download the individual binary files.
You can combine the crawl_date and the URL, along with either the Global Wayback prefix or the Archive-It prefix, to set up a list of files for download.
For example, given the data from the extracted PDF information datasets:
|
|
|
|
from the collection ARCHIVEIT-03492, you could construct a URL to directly retrieve the identified PDF document, such as:
https://wayback.archive-it.org/3492/20130113/http://www.wired.com/images_blogs/threatlevel/2012/09/swartzsuperseding.pdf
Programming Historian also has a tutorial for Working with batches of PDF files. It covers how to perform OCR and text extraction with free command-line tools like Tesseract and Poppler and how to get an overview of large numbers of PDF documents using topic modelling.
Comments
0 comments
Please sign in to leave a comment.