Overview
You can upload and integrate WARC or ARC files into your Archive-It collections if the files were created with external capture technologies, including inherited or legacy web archiving systems and/or donations. Like normal Archive-It web crawls, uploaded files count towards your account's data budget.
On this page:
Upload and store WARC or ARC files
The Upload WARCs tab is accessible in each of your collections. If you do not see the tab, submit a request to enable this feature.
To upload a file to your collection, select Browse to find your file(s) in local or networked storage, and then select Upload to initiate the process.
Note: To avoid uploading errors, avoid using special characters in filenames, such as: ! " # $ % & '
Do not refresh, close, or navigate away from this view until it refreshes automatically to display your uploaded file(s) in the WARC Upload List. Upload speeds will vary depending upon your local bandwidth and the size of your file(s).
The WARC Upload List includes the following information about your uploaded files:
- WARC Filename: The name of each uploaded file as it appears in Archive-It storage and the Archive-It Wayback index, following the naming convention: ARCHIVEIT-[Collection number]-EXTERNAL-[UPLOAD TIMESTAMP]-[ORIGINAL FILENAME].
- File size: The volume of the uploaded file in storage.
- MD Hash: An md5 checksum value generated to uniquely “fingerprint” the contents of each uploaded file, which may be compared to the original file’s checksum in order to verify integrity.
- Status: Updated according to each file’s stage in the process of being permanently added to Archive-It. Immediately upon upload this status will be Processing. After completing file format validation and depositing in storage, the status will change to Stored.
- Date: The day on which each file was uploaded.
Replay external WARC or ARC files in Wayback
To replay the contents of your uploaded WARC or ARC files in Wayback, it is necessary to have an access point in the form of a seed URL in your collection's Seeds tab. The seed URL can be the starting point or any other page in your externally-collected WARC or ARC files.
Once the seed URL is added to your collection's Seeds tab, there is no need to crawl this or any other seed URL in order to replay the contents of your external files. It may take up to 24 hours after the file(s) is uploaded and stored to index and replay in Wayback.
To replay in Wayback, select the Wayback calendar link in your collection's Seeds tab. If your seed access is public, you can also select the seed URL on archive-it.org.
In Wayback, each uploaded document displays its capture date, such as this uploaded document that was crawled in November 2015 and added to an Archive-It collection in August 2017:
Outcome
External WARC or ARC files added to your Archive-It account can replay in Wayback, are discoverable by full-text search and your own descriptive metadata, and are stored and preserved like any files collected using Archive-It.
Quality assurance: What to expect
Archive-It can ingest and verify the integrity of WARC and ARC files created by external capture technologies. However, this does not guarantee that the indexed contents of these files, especially videos, will replay in Archive-It's Wayback replay mode precisely as they do in other separate and/or legacy replay systems. Please feel free to report significant differences in the appearances of your archived documents, but understand that the our ability to provide support will be limited without control of the original capture technologies.
Comments
0 comments
Please sign in to leave a comment.