Document Type
Article
Publication Date
2014
DOI
10.1007/s00799-014-0108-0
Publication Title
International Journal on Digital Libraries
Volume
14
Issue
1-2
Pages
17-38
Abstract
Inaccessible Web pages and 404 “Page Not Found” responses are a common Web phenomenon and a detriment to the user’s browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, we are able to recommend not only the best performing methods but also the sequence in which they should be applied, based on their performance, complexity required to generate them, and evolution over time. Our least complex single method results in a rediscovery rate of almost 70% of Web pages of our sample dataset based on URIs sampled from the Open Directory Project (DMOZ). By increasing the complexity level and combining three different methods, our results show an increase of the success rate up to 77%. The results, based on our sample dataset, indicate that Web pages are often not completely lost but have moved to a different location and "just" need to be rediscovered.
Rights
© 2014 The Authors.
This article is distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License.
Original Publication Citation
Klein, M., & Nelson, M. (2014). Moved but not gone: An evaluation of real-time methods for discovering replacement web pages. International Journal on Digital Libraries, 14(1/2), 17-38. https://doi.org/10.1007/s00799-014-0108-0
Repository Citation
Klein, M., & Nelson, M. (2014). Moved but not gone: An evaluation of real-time methods for discovering replacement web pages. International Journal on Digital Libraries, 14(1/2), 17-38. https://doi.org/10.1007/s00799-014-0108-0
ORCID
0000-0003-3749-8116 (Nelson)