Date of Award

Spring 2019

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

Committee Director

Michele C. Weigle

Committee Member

Michael L. Nelson

Committee Member

Jian Wu

Committee Member

Sampath Jayarathna

Committee Member

Christian Zemlin

Abstract

Web archives are a window to view past versions of webpages. When a user requests a webpage on the live Web, such as http://tripadvisor.com/where_to_t ravel/, the webpage may not be found, which results in an HyperText Transfer Protocol (HTTP) 404 response. The user then may search for the webpage in a Web archive, such as the Internet Archive. Unfortunately, if this page had never been archived, the user will not be able to view the page, nor will the user gain any information on other webpages that have similar content in the archive, such as the archived webpage http://classy-travel.net. Similarly, if the user requests the webpage http://hokiesports.com/football/ from the Internet Archive, the user will only find the requested webpage, and the user will not gain any information on other webpages that have similar content in the archive, such as the archived webpage http://techsideline.com. In this research, we will build a model for selecting and ranking possible recommended webpages at a Web archive. This is to enhance both HTTP 404 responses and HTTP 200 responses by surfacing webpages in the archive that the user may not know existed. First, we detect semantics in the requested Uniform Resource Identifier (URI). Next, we classify the URI using an ontology, such as DMOZ or any website directory. Finally, we filter and rank candidates based on several features, such as archival quality, webpage popularity, temporal similarity, and content similarity. We measure the performance of each step using different techniques, including calculating the F1 to measure of different tokenization methods and the classification. We tested the model using human evaluation to determine if we could classify and find recommendations for a sample of requests from the Internet Archive’s Wayback Machine access log. Overall, when selecting the full categorization, reviewers agreed with 80.3% of the recommendations, which is much higher than “do not agree” and “I do not know”. This indicates the reviewer is more likely to agree on the recommendations when selecting the full categorization. But when selecting the first level only, reviewers only agreed with 25.5% of the recommendations. This indicates that having deep level categorization improves the performance of finding relevant recommendations.

Rights

In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).

DOI

10.25777/yk35-dd38

ISBN

9781392268124

Recommended Citation

Alkwai, Lulwah M.. "Expanding the Usage of Web Archives by Recommending Archived Webpages Using Only the URI" (2019). Doctor of Philosophy (PhD), Dissertation, Computer Science, Old Dominion University, DOI: 10.25777/yk35-dd38
https://digitalcommons.odu.edu/computerscience_etds/90

ORCID

0000-0002-6424-961X

Download

Included in

Computer Sciences Commons

COinS

Computer Science Theses & Dissertations

Expanding the Usage of Web Archives by Recommending Archived Webpages Using Only the URI

Date of Award

Document Type

Degree Name

Department

Committee Director

Committee Member

Committee Member

Committee Member

Committee Member

Abstract

Rights

DOI

ISBN

Recommended Citation

ORCID

Included in

Search

Browse

Contribute

Links

Contact Us

Computer Science Theses & Dissertations

Expanding the Usage of Web Archives by Recommending Archived Webpages Using Only the URI

Author

Date of Award

Document Type

Degree Name

Department

Committee Director

Committee Member

Committee Member

Committee Member

Committee Member

Abstract

Rights

DOI

ISBN

Recommended Citation

ORCID

Included in

Share

Search

Browse

Contribute

Links

Contact Us