Date of Award
Summer 2011
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science
Committee Director
Michael L. Nelson
Committee Director
Yaohang Li
Committee Member
Michele C. Weigle
Committee Member
Mohammad Zubair
Committee Member
Robert Sanderson
Committee Member
Herbert Van de Sompel
Abstract
Given the dynamic nature of the World Wide Web, missing web pages, or "404 Page not Found" responses, are part of our web browsing experience. It is our intuition that information on the web is rarely completely lost, it is just missing. In whole or in part, content often moves from one URI to another and hence it just needs to be (re-)discovered. We evaluate several methods for a \justin- time" approach to web page preservation. We investigate the suitability of lexical signatures and web page titles to rediscover missing content. It is understood that web pages change over time which implies that the performance of these two methods depends on the age of the content. We therefore conduct a temporal study of the decay of lexical signatures and titles and estimate their half-life. We further propose the use of tags that users have created to annotate pages as well as the most salient terms derived from a page's link neighborhood. We utilize the Memento framework to discover previous versions of web pages and to execute the above methods. We provide a work ow including a set of parameters that is most promising for the (re-)discovery of missing web pages. We introduce Synchronicity, a web browser add-on that implements this work ow. It works while the user is browsing and detects the occurrence of 404 errors automatically. When activated by the user Synchronicity offers a total of six methods to either rediscover the missing page at its new URI or discover an alternative page that satisfies the user's information need. Synchronicity depends on user interaction which enables it to provide results in real time.
Rights
In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
DOI
10.25777/jdht-6564
ISBN
9781124990699
Recommended Citation
Klein, Martin.
"Using the Web Infrastructure for Real Time Recovery of Missing Web Pages"
(2011). Doctor of Philosophy (PhD), Dissertation, Computer Science, Old Dominion University, DOI: 10.25777/jdht-6564
https://digitalcommons.odu.edu/computerscience_etds/20
ORCID
0000-0003-0130-2097