Vault Storage, Preservation & Sustainability
Internet Archive is a 501(c)(3) non-profit, public charity organization founded in 1996 and headquartered in San Francisco, California, USA. Internet Archive is an online digital library, and an official research library as designated by the State of California, whose mission is “Universal Access to All Knowledge.” Internet Archive owns and operates its own data centers, and also owns and operates infrastructure co-located in the data centers of mission-aligned partners, such as research universities. Internet Archive currently stewards over 99 petabytes of unique data, with multiple copies of this collection preserved, amounting to a total archive of hundreds of petabytes of data. Over 1 million users visit archive.org every day, making it a top 200 most-visited site on the web. Internet Archive works with thousands of partner organizations around the world to provide services, products, and infrastructure, and to pursue collaborative initiatives and technology development ensuring the preservation of, and perpetual access to, the cultural, social, and scientific record.
Storage
Storage Locations
The standard Vault service guarantee is that all data deposited into Vault will be stored with a minimum of three copies (often four copies) in at least two physical locations in Internet Archive's self-owned and self-operated data centers, thus providing a mission-aligned, non-profit technical infrastructure in support of Vault and other Internet Archive digital library products. The primary Internet Archive data centers are located in California, USA and Internet Archive operates additional data centers both in other geographic regions of the United States and in other countries, especially Canada and the Netherlands.
Storage Features
Vault features allow users to store additional replicas, beyond the minimum of three copies, for specifically designated collections. Other features allow users to designate that specific collections be stored in data centers outside the United States, currently either in Canada or in Europe. Data stored outside the United States may be stored in Internet Archive-owned and operated data centers or on Internet Archive-owned and operated hardware that is co-located in the data centers of non-profit partner organizations (usually within the data center of a research university). All Vault user data is stored and hosted in controlled-access, alarmed, fire-protected buildings. Data integrity and system availability are assured using a combination of internal and external systems and processes. See our “Data & Data Center Security & Procedures” for more information on data center procedures. All data centers aim to operate as efficiently as possible as far as power and climate monitoring, in order to take as eco-friendly an approach as possible to core infrastructure operations.
Storage Management
Data in Vault is periodically migrated onto new physical media to account proactively for physical media reliability. Monitoring, logging and notification systems escalate any hardware issues to an on-call team responsible for infrastructure maintenance. Vault users are notified in advance of any routine maintenance or system reconfiguration with the potential of service interruption. Incidents such as service outages, networking issues, or other irregular performance parameters exceeding operating tolerances are detected, tracked on system support tools, and addressed promptly. All Vault data is stored in multiple repository systems and architectures, including object storage and block storage, thus ensuring a diversity of technical architecture. Data deposited into Vault is stored in the same form and format as deposited -- data is not transformed, uncompressed, unpackaged, format-migrated, or post-processed in any way. Other non-archival data related to partner collections, such as metadata, reports, analytics, et cetera, is stored in multiple, replicated databases and is generally available to partners in common structured formats such as JSON and XML generally via download and/or API.
Preservation
Preservation & Fixity
A checksum algorithm procedure is run on all digital objects deposited in Vault, resulting in a checksum (also known as a hash) created using the SHA-1, MD5, and SHA-256 algorithms for each digital object. A checksum is the equivalent of a “digital fingerprint” for a digital object. A fixity audit and repair procedure is the act of ensuring that a digital object has not changed or become corrupted (audit) and, if it has, replacing the corrupted digital object with a valid, uncorrupted copy (repair). The audit procedure generates a new checksum and compares it to the digital object’s original checksum to verify its fixity. If the newly-created checksum does not match the original checksum, the audit procedure deletes the altered replica and replaces it with a verified, unchanged copy. The resulting fixity report details the activities and outcomes of the fixity audit and repair procedure and is available in the Vault web application and for download. See this blog post from Library of Congress or this guide from the Digital Preservation Consortium for more information on file fixity and digital preservation.
Preservation Features
The standard Vault service guarantee is that a user’s data will have its fixity checked at least twice a year with a corresponding fixity report generated and made available to users for download and/or via API. Operationally, data in Vault may be audited and repaired more than twice a year as part of general infrastructure operations, but these additional procedures will not generate a report. Additional Vault features allow users to designate specific collections that can receive additional fixity audit and repair procedures beyond the standard two per year. See the help center page on “Understanding Fixity Reports” for more information on fixity reporting.
Preservation Integrations
Internet Archive also collaborates with multiple other preservation systems, including services such as LOCKSS and DuraCloud, to facilitate integrations for automated replication of users' archival data to these systems. If you are interested in external integration or replication options, please contact our support team or consult the help center documentation. The Vault development roadmap also includes plans for additional integrations with other preservation and collection management systems as well as creating additional preservation features for Vault. Finally, Vault provides multiple ways for partners to download their data from Vault for local storage and preservation or ingest into other preservation systems.
Preservation Best Practices
The creators of Vault were co-creators of the widely-used NDSA Levels of Digital Preservation and Vault is designed to be an affordable, extensible solution that embeds the goals and spirit of that guidance into product development -- namely, to make digital preservation possible for any organization, regardless of size, staff, expertise, or budget. Vault enables any user to meet the level of digital preservation that is most appropriate for their organization. This means the product can be used for both basic object storage that complements in-house digital preservation efforts and for a highly-replicated, highly-geodistributed all-in-one product solution to preservation management. Vault meets these needs and others by allowing various add-on product features to be assigned at the collection, not account, level. Vault also takes a unique approach to pricing, with costs based on a one-time, per gigabyte/terabyte, “forever” fee, that eschews the traditional annual storage costs of many other digital preservation services. This allows for a greater ease of budgeting and budget planning for many organizations and is possible due to Vault’s use of Internet Archive’s non-profit, self-owned infrastructure and avoidance of the commercial or third-party cloud storage that underpins many other services. See “Vault Pricing Guide” for more information. Vault is planning on pursuing certification by CoreTrustSeal in 2024.
Preservation Stewardship
No data in Vault is made available to any third parties without the express written consent of the organization that originally deposited the data into the system.
Sustainability
Internet Archive is a US-based 501(c)(3) non-profit, public charity organization founded in 1996. It is sustained through a mix of endowments, donations, philanthropic support, earned income through paid services and contracts, government support, and general fundraising. Internet Archive works closely with a number of affiliated, but independent, organizations based in other countries, including Internet Archive Canada (headquartered in Vancouver), Internet Archive UK (headquartered in London), and Stitching Internet Archive (headquartered in Amsterdam). Each of these organizations are registered, independent, non-profit organizations in their respective countries with their own staff, leadership, Board of Directors (or equivalent), finances, infrastructure, et cetera, each legally independent entities. Internet Archive has a Memorandum of Understanding (MOU) with Internet Archive Canada guaranteeing that in the event that Internet Archive plans to shut down or otherwise ceases to exist, all Vault data (including replicas) will be transferred to and continued to be preserved by Internet Archive Canada. In this event, ample notice will be given to all impacted users, and Internet Archive Canada guarantees all relevant service requirements related to the storage, preservation, and access (either public or private, as determined by the user) of data in Vault will continue.
Comments
Please sign in to leave a comment.