- The contents of partner crawls are stored in the (W)ARC file format and hosted on servers within one of Internet Archive's digital repositories in the greater San Francisco Bay Area.
- The primary copy of Archive-It partner data is located in Richmond, California. The data is stored and hosted in a controlled-access, alarmed, fire-protected building. Data integrity and system availability are assured using a combination of internal and external systems and processes. A mirror copy of the archived data is kept on a separate set of machines within a different Internet Archive repository. Most partner organizations have a third copy stored at a third Internet Archive data center in the San Francisco area.
- Additionally, we have a dark copy of Archive-It data stored for preservation at a university in the Eastern United States. A mirror of the Archive-It repository was sent to this location in 2013 and an update is planned for 2017. We are currently researching additional partnerships with non-profit research data centers for added data redundancy in offline storage. We are also exploring emerging multi-copy storage systems within our own repositories.
- Archive-It also has alliances with both LOCKSS and DuraCloud to aid in our partner's preservation strategies as well as tools for partners to download their data for local storage and preservation.
Mount-secure, quality-controlled website
We provide a secure website for browsing and access to the partners' content. Security measures include:
- Physical security of the data repository, which has controlled access and is alarmed and fire-protected.
- Security and monitoring of the harvested data is accomplished through a mix of internal and external systems; data integrity through internal routine tests; and system availability through the use of a commercial web service’s monitoring capabilities.
- Data is refreshed onto the physical media and data integrity is maintained through digitally fingerprinting it through a hash, comparing it with a previous hash, and rewriting the content to new blocks on a disk.
- Successes and failures to match expected results are logged, and appropriate individuals are notified in case of failure. Any reported problems are individually investigated, and our team repairs them by replacing failed hardware or restoring content from alternate copies.
- Incidents such as a service outages or a service performance parameters exceeding operating tolerances are detected, tracked on system support tools, and addressed promptly.
- Partners are notified in advance of any routine maintenance or system reconfiguration with the potential of service interruption.
Best practices and standards
- The Internet Archive has been archiving the web since 1996, has been able to adapt to every file format that has changed, and has made it accessible for viewing through the Wayback Machine. The Wayback Machine software and our primary crawler, Heritrix, are open source tools which have global institutional buy-in and a large invested community committed to ensuring their continued success and viability. In addition we continue to develop new capture and replay technologies and release them under open-source licenses.
- Internet Archive is a founding member of the IIPC (http://netpreserve.org/), which acknowledges the importance of international collaboration for preserving Internet content for future generations, and works toward common global formats, including the development of the (W)ARC file.
- The (W)ARC file format is an ISO standard (CD 28500).
- The (W)ARC format is a revision of the ARC File Format [ARC_IA] developed in the mid-1990's, which was first used to store Web crawls as sequences of content blocks harvested from the World Wide Web.
- We do not currently plan to migrate legacy ARCs into WARCs, as ARC continues to be a widely supported format. The open source access tools we use support both file types.