Date of Award

Summer 2024

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

Program/Concentration

Computer Science

Committee Director

Michele C. Weigle

Committee Member

Michael L. Nelson

Committee Member

Sampath Jayarathna

Abstract

Webpages change over time, and web archives hold copies of historical versions of webpages. Users of web archives, such as journalists, want to find and view changes on webpages over time. However, the current search interfaces for web archives do not adequately support this task. For the web archives that include a full-text search feature, multiple versions of the same webpage that match the search query are shown individually without enumerating changes, or are grouped together in a way that hides changes. We present a change text search engine that allows users to find changes in webpages. We describe the implementation of the search engine backend and frontend, including a tool that allows users to view the changes between two webpage versions in context as an animation. We also propose changes to the Internet Archive’s Wayback Machine replay navigation banner to further support users viewing change over time. We evaluate the search engine with U.S. federal environmental webpages that changed between 2016 and 2020. The change text search results page can clearly show when terms and phrases were added or removed from webpages. The inverted index can also be queried to identify salient and frequently deleted terms in a corpus. We align the dataset to with a real-world click dataset, showing that users were searching for the same environmental terms that were ultimately deleted.

Rights

In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).

DOI

10.25777/zaq9-sb74

ISBN

9798384444374

ORCID

0000-0003-0929-049X

Share

COinS