Document Type

Article

Publication Date

2023

DOI

10.1371/journal.pone.0286879

Publication Title

PLoS One

Volume

18

Issue

6

Pages

e0286879 (1-49)

Abstract

Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the hash values generated on the same resource are identical, then the fixity of the resource is verified. We tested this process by conducting a study on 16,627 mementos from 17 public web archives. We replayed and downloaded the mementos 39 times using a headless browser over a period of 442 days and generated a hash for each memento after each download, resulting in 39 hashes per memento. The hash is calculated by including not only the content of the base HTML of a memento but also all embedded resources, such as images and style sheets. We expected to always observe the same hash for a memento regardless of the number of downloads. However, our results indicate that 88.45% of mementos produce more than one unique hash value, and about 16% (or one in six) of those mementos always produce different hash values. We identify and quantify the types of changes that cause the same memento to produce different hashes. These results point to the need for defining an archive-aware hashing function, as conventional hashing functions are not suitable for replayed archived web pages.

Rights

© 2023 Aturban et al.

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability

Article states: All Merkle tree (hash) data files are available via Zenodo at (https://zenodo.org/record/7082486, DOI: 10.5281/zenodo.7082486).

Original Publication Citation

Aturban, M., Klein, M., Van de Sompel, H., Alam, S., Nelson, M. L., & Weigle, M. C. (2023). Hashes are not suitable to verify fixity of the public archived web. PLoS One, 18(6), 1-49, Article e0286879. https://doi.org/10.1371/journal.pone.0286879

ORCID

0000-0003-3749-8116 (Nelson), 0000-0002-2787-7166 (Weigle)

Share

COinS