Date of Award
Fall 2007
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science
Committee Director
Michael L. Nelson
Committee Director
William Y. Arms
Committee Member
Johan Bollen
Committee Member
Kurt Maly
Committee Member
Ravi Mukkamala
Committee Member
Mohammad Zubair
Abstract
Backup or preservation of websites is often not considered until after a catastrophic event has occurred. In the face of complete website loss, webmasters or concerned third parties have attempted to recover some of their websites from the Internet Archive. Still others have sought to retrieve missing resources from the caches of commercial search engines. Inspired by these post hoc reconstruction attempts, this dissertation introduces the concept of lazy preservation{ digital preservation performed as a result of the normal operations of the Web Infrastructure (web archives, search engines and caches). First, the Web Infrastructure (WI) is characterized by its preservation capacity and behavior. Methods for reconstructing websites from the WI are then investigated, and a new type of crawler is introduced: the web-repository crawler. Several experiments are used to measure and evaluate the effectiveness of lazy preservation for a variety of websites, and various web-repository crawler strategies are introduced and evaluated. The implementation of the web-repository crawler Warrick is presented, and real usage data from the public is analyzed. Finally, a novel technique for recovering the generative functionality (i.e., CGI programs, databases, etc.) of websites is presented, and its effectiveness is demonstrated by recovering an entire Eprints digital library from the WI.
Rights
In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
DOI
10.25777/ys8r-nj25
ISBN
9780549320395
Recommended Citation
McCown, Frank.
"Lazy Preservation: Reconstructing Websites from the Web Infrastructure"
(2007). Doctor of Philosophy (PhD), Dissertation, Computer Science, Old Dominion University, DOI: 10.25777/ys8r-nj25
https://digitalcommons.odu.edu/computerscience_etds/21