Document Type

Article

Publication Date

2014

DOI

10.1007/s00799-014-0108-0

Publication Title

International Journal on Digital Libraries

Volume

14

Issue

1-2

Pages

17-38

Abstract

Inaccessible Web pages and 404 “Page Not Found” responses are a common Web phenomenon and a detriment to the user’s browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, we are able to recommend not only the best performing methods but also the sequence in which they should be applied, based on their performance, complexity required to generate them, and evolution over time. Our least complex single method results in a rediscovery rate of almost 70% of Web pages of our sample dataset based on URIs sampled from the Open Directory Project (DMOZ). By increasing the complexity level and combining three different methods, our results show an increase of the success rate up to 77%. The results, based on our sample dataset, indicate that Web pages are often not completely lost but have moved to a different location and "just" need to be rediscovered.

Rights

© 2014 The Authors.

This article is distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License.

Original Publication Citation

Klein, M., & Nelson, M. (2014). Moved but not gone: An evaluation of real-time methods for discovering replacement web pages. International Journal on Digital Libraries, 14(1/2), 17-38. https://doi.org/10.1007/s00799-014-0108-0

ORCID

0000-0003-3749-8116 (Nelson)

Share

COinS