Date of Award

Fall 2007

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

Committee Director

Michael L. Nelson

Committee Director

William Y. Arms

Committee Member

Johan Bollen

Committee Member

Kurt Maly

Committee Member

Ravi Mukkamala

Committee Member

Mohammad Zubair

Abstract

Backup or preservation of websites is often not considered until after a catastrophic event has occurred. In the face of complete website loss, webmasters or concerned third parties have attempted to recover some of their websites from the Internet Archive. Still others have sought to retrieve missing resources from the caches of commercial search engines. Inspired by these post hoc reconstruction attempts, this dissertation introduces the concept of lazy preservation{ digital preservation performed as a result of the normal operations of the Web Infrastructure (web archives, search engines and caches). First, the Web Infrastructure (WI) is characterized by its preservation capacity and behavior. Methods for reconstructing websites from the WI are then investigated, and a new type of crawler is introduced: the web-repository crawler. Several experiments are used to measure and evaluate the effectiveness of lazy preservation for a variety of websites, and various web-repository crawler strategies are introduced and evaluated. The implementation of the web-repository crawler Warrick is presented, and real usage data from the public is analyzed. Finally, a novel technique for recovering the generative functionality (i.e., CGI programs, databases, etc.) of websites is presented, and its effectiveness is demonstrated by recovering an entire Eprints digital library from the WI.

ISBN

9780549320395

Share

COinS