Overview
You may upload and integrate WARC or ARC files into your Archive-It collections if the files were created with other, external capture technologies, including inherited or legacy web archiving systems and/or donations. Like normal Archive-It web crawls, uploaded files count towards your account's data budget.
Contact the Internet Archive's web archivists to enable this feature in your Archive-It account. They can provide support if this tool is the best, most efficient way to add external files to Archive-It.
Instructions
On this page:
Upload and store WARC or ARC files
Once enabled in your account, the Upload W/ARCs tab is accessible in each of your collections:
To add to your collection, use the "Browse…" function to find your file(s) in local or networked storage, then click the "Upload" button to initiate the process of adding these files to your collection:
NB: To avoid uploading errors, avoid using special characters in filenames, such as: ! " # $ % & '
Upload speeds will vary depending upon your local bandwidth and the size of your file(s), so please do not refresh, close, or navigate away from this view until it refreshes automatically to display your uploaded file(s) in the table below:
This table represents the following information about your uploaded files:
- WARC Filename: The name of each uploaded file as it appears in Archive-It storage and the Archive-It Wayback index, following the naming convention: ARCHIVEIT-[Collection number]-EXTERNAL-[UPLOAD TIMESTAMP]-[ORIGINAL FILENAME].
- File size: The volume of the uploaded file in storage.
- MD Hash: An md5 checksum value generated to uniquely “fingerprint” the contents of each uploaded file, which may be compared to the original file’s checksum in order to verify integrity.
- Status: Updated according to each file’s stage in the process of being permanently added to Archive-It. Immediately upon upload this status will be Processing. After completing file format validation and depositing in storage, the status will change to Stored.
- Date: The day on which each file was uploaded.
Replay external WARC or ARC files in Wayback mode
NB: In order to browse the contents of these files in Wayback mode, it is necessary to first add an access point to your collection in the form of a seed URL. Choose the URL of the starting point or any other page in your externally captured WARC or ARC files to serve as the access point in your collection's Seeds list and public users on archive-it.org.
There is no need to crawl this or any other seed URL in order to replay the contents of your external files. Like WARCs created with Archive-It, these uploaded files may require up to 24 hours after storing to appear for web browsing in Wayback replay mode. Once indexed, you can access the results from the "Wayback" calendar links under the collection's Seeds tab or, if public, on archive-it.org:
In Wayback mode, each uploaded document displays its capture date, such as this uploaded document that was crawled in November 2015 and added to an Archive-It collection in August 2017:
Outcome
External WARC or ARC files added to your Archive-It account can replay in Wayback mode, they will be discoverable by full-text search and your own descriptive metadata, and they may be stored and preserved like any files collected using Archive-It.
Quality assurance: What to expect
Archive-It can ingest and verify the integrity of WARC and ARC files created by external capture technologies. However, this does not guarantee that the indexed contents of these files, especially videos, will replay in Archive-It's Wayback replay mode precisely as they do in other separate and/or legacy replay systems. Please feel free to report significant differences in the appearances of your archived documents, but understand that the Internet Archive's staff’s ability to provide support will be limited without control of the original capture technologies.
Comments
0 comments
Please sign in to leave a comment.