Storage & Preservation

Archive-It partners’ web archive data is stored in the (W)ARC file format, an ISO standard based on the prior ARC format that was designed by the Internet Archive. Other non-archival data related to partner collections, such as descriptive metadata, crawl reports, et cetera, is stored in multiple, replicated databases and is generally available to partners in common structured formats such as JSON and XML.
Archive-It partners’ web archive data is hosted on servers within at least two of Internet Archive's self-owned and self-operated data centers in separate locations.
The primary Internet Archive data centers are located in multiple locations within the greater San Francisco Bay area in California, USA. The Internet Archive owns additional data centers both in other geographic regions of the United States and in other countries. Mirror copies of some Archive-It partners' archival data may also be held in these other locations for further preservation replication or as part of contractual obligations, including in dark archives.

The Internet Archive maintains a minimum of two verified copies of all Archive-It partners’ web archive data and often four or more separate copies of this data according to various contractual or partner requirements or additive services.
Archive-It partners’ web archive data is stored and preserved in diverse repository systems and architecture, ensuring a diversity of technological systems with which this data is managed.
Periodic integrity checks are performed on all Archive-It web archive data to ensure its fixity.
All Archive-It partner data is stored and hosted in a controlled-access, alarmed, fire-protected building. Data integrity and system availability are assured using a combination of internal and external systems and processes.
Archive-It also allies with multiple other preservation systems, including LOCKSS and DuraCloud, to facilitate automated replication of Archive-It partners' archival data into other hosted and local preservation systems. Archive-It also provides multiple ways for partners to download their data for local storage and preservation or ingest into other systems.
Security and monitoring of the harvested data is accomplished through a mix of internal and external systems; data integrity through internal routine tests; and system availability through the use of internal and commercial web monitoring services.
Archived data is periodically migrated onto new physical media to account proactively for physical media reliability. Monitoring, logging and notification systems escalate any hardware issues to an on-call team responsible for infrastructure maintenance.
Incidents such as a service outages, networking issues, or other irregular performance parameters exceeding operating tolerances are detected, tracked on system support tools, and addressed promptly.
Partners are notified in advance of any routine maintenance or system reconfiguration with the potential of service interruption.

Best practices and standards

The Internet Archive has been archiving the web since 1996 and has decades of expertise in archiving web-published materials of all types and in making this archived content accessible for viewing through the Wayback Machine and through collaborative partnerships and other discovery and access points.
As the first institution archiving the public web at scale for historical preservation purposes, the Internet Archive is the creator of the majority of the technologies, systems, formats, and processes used by hundreds of institutions worldwide. The main web archiving technologies created by the Internet Archive are available under open-source license and have a large, invested community of institutional users committed to ensuring their ongoing success and improvement. In addition we continue to develop new capture, replay, indexing, and other archiving technologies and release them under open-source licenses for community use.
Internet Archive is a founding member of the IIPC, a global coalition that advances the preservation of Internet and web content for future generations through the international collaboration.
The (W)ARC file format is an ISO standard (CD 28500) and the Internet Archive remains actively involved in the advancement of the WARC format specification.
The (W)ARC format is a revision of the ARC File Format [ARC_IA] developed in the mid-1990's, which was first used to store Web crawls as sequences of content blocks harvested from the World Wide Web.

Articles in this section

Archive-It Storage and Preservation Policy

Storage & Preservation

Best practices and standards

Comments

Articles in this section

Storage & Preservation

Best practices and standards

Related articles