Date of Award

Summer 8-2020

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

Committee Director

Michael L. Nelson

Committee Member

Michele C. Weigle

Committee Member

Jian Wu

Committee Member

Sampath Jayarathna

Committee Member

Ross Gore

Abstract

In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events. These archived collections start with seed URIs (Uniform Resource Identifiers) hand-selected by curators. Curators produce high quality seeds by removing non-relevant URIs and adding URIs from credible and authoritative sources, but this ability comes at a cost: it is time consuming to collect these seeds. The result of this is a shortage of curators, a lack of Web archive collections for various important news events, and a need for an automatic system for generating seeds.

We investigate the problem of generating seed URIs automatically, and explore the state of the art in collection building and seed selection. Attempts toward generating seeds automatically have mostly relied on scraping Web or social media Search Engine Result Pages (SERPs). In this work, we introduce a novel source for generating seeds from URIs in the threaded conversations of social media posts created by single or multiple users. Users on social media sites routinely create and share narratives about news events consisting of hand-selected URIs of news stories, tweets, videos, etc. In this work, we call these posts Micro-collections, whether shared on Reddit or Twitter, and we consider them as an important source for seeds. This is because, the effort taken to create Micro-collections is an indication of editorial activity and a demonstration of domain expertise. Therefore, we propose a model for generating seeds from Micro-collections. We begin by introducing a simple vocabulary, called post class for describing social media posts across different platforms, and extract seeds from the Micro-collections post class. We further propose Quality Proxies for seeds by extending the idea of collection comparison to evaluation, and present our Micro-collection/Quality Proxy (MCQP) framework for bootstrapping Web archive collections from Micro-collections in social media.

Rights

In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).

DOI

10.25777/ez78-cb43

ISBN

9798672170343

Recommended Citation

Nwala, Alexander C.. "Bootstrapping Web Archive Collections From Micro-Collections in Social Media" (2020). Doctor of Philosophy (PhD), Dissertation, Computer Science, Old Dominion University, DOI: 10.25777/ez78-cb43
https://digitalcommons.odu.edu/computerscience_etds/124

ORCID

0000-0003-3408-791X

Download

Included in

Databases and Information Systems Commons, Library and Information Science Commons, Social Media Commons

COinS

Computer Science Theses & Dissertations

Bootstrapping Web Archive Collections From Micro-Collections in Social Media

Date of Award

Document Type

Degree Name

Department

Committee Director

Committee Member

Committee Member

Committee Member

Committee Member

Abstract

Rights

DOI

ISBN

Recommended Citation

ORCID

Included in

Search

Browse

Contribute

Links

Contact Us

Computer Science Theses & Dissertations

Bootstrapping Web Archive Collections From Micro-Collections in Social Media

Author

Date of Award

Document Type

Degree Name

Department

Committee Director

Committee Member

Committee Member

Committee Member

Committee Member

Abstract

Rights

DOI

ISBN

Recommended Citation

ORCID

Included in

Share

Search

Browse

Contribute

Links

Contact Us