Factors Affecting Website Reconstruction From the Web Infrastructure

Document Type

Conference Paper

Publication Date

6-2007

DOI

10.1145/1255175.1255182

Publication Title

Proceedings of the ACM International Conference on Digital Libraries

Pages

39-48

Conference Name

JCDL’07, June 17–22, 2007, Vancouver, British Columbia, Canada

Abstract

When a website is suddenly lost without a backup, it may be reconstituted by probing web archives and search engine caches for missing content. In this paper we describe an experiment where we crawled and reconstructed 300 randomly selected websites on a weekly basis for 14 weeks. The reconstructions were performed using our web-repository crawler named Warrick which recovers missing resources from the Web Infrastructure (WI), the collective preservation effort of web archives and search engine caches. We examine several characteristics of the websites over time including birth rate, decay and age of resources. We evaluate the reconstructions when compared to the crawled sites and develop a statistical model for predicting reconstruction success from the WI. On average, we were able to recover 61% of each website’s resources. We found that Google’s PageRank, number of hops and resource age were the three most significant factors in determining if a resource would be recovered from the WI.

Original Publication Citation

McCown, F., Diawara, N., & Nelson, M. L. (2007). Factors affecting Website reconstruction from the Web infrastructure. In JCDL '07: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries 2007. (pp.39–48). ACM Digital Library. https://doi.org/10.1145/1255175.1255182

Repository Citation

McCown, F., Diawara, N., & Nelson, M. L. (2007). Factors affecting Website reconstruction from the Web infrastructure. In JCDL '07: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries 2007. (pp.39–48). ACM Digital Library. https://doi.org/10.1145/1255175.1255182

ORCID

0000-0002-8403-6793 (Diawara), 0000-0003-3749-8116 (Nelson)

Computer Science Faculty Publications

Factors Affecting Website Reconstruction From the Web Infrastructure

Document Type

Publication Date

DOI

Publication Title

Pages

Conference Name

Abstract

Original Publication Citation

Repository Citation

ORCID

Included in

Search

Browse

Contribute

Links

Contact Us

Computer Science Faculty Publications

Factors Affecting Website Reconstruction From the Web Infrastructure

Authors

Document Type

Publication Date

DOI

Publication Title

Pages

Conference Name

Abstract

Original Publication Citation

Repository Citation

ORCID

Included in

Share

Search

Browse

Contribute

Links

Contact Us