Date of Award
Fall 2024
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science
Program/Concentration
Computer Science
Committee Director
Jian Wu
Committee Member
Michael L. Nelson
Committee Member
Michele C. Weigle
Committee Member
Sampath Jayarathna
Committee Member
Edward A. Fox
Abstract
In the past decades, there has been a growing interest in mining scientific documents to obtain domain knowledge automatically. One of the understudied types of scientific documents is Electronic Theses and Dissertations (ETDs), as ETDs have distinct features compared with conference proceedings and journal articles. ETDs usually serve as partial requirements of academic degrees for students pursuing higher education. They are book-length documents (i.e., 100 – 400 pages long), and the topics may shift across chapters, exhibit the significant contribution of a student’s research over the entire degree pursuing period, and have unique metadata schema and page layouts. However, the digital libraries of ETDs lack computational models and services for accessing and discovering the knowledge buried in ETDs. Moreover, library-provided metadata often exhibits incomplete, inconsistent, and incorrect values, which harms the discoverability of ETDs. Although many mining frameworks are developed to support document segmentation, metadata extraction, metadata quality improvement, and parsing reference strings for journal articles and conference proceedings, they usually do not generalize well to ETDs. One obstacle that makes the ETD mining tasks challenging is the lack of training samples. There is also a lack of models to generate high fidelity synthesized ETD data for training deep learning based models. To address these research gaps, we developed a toolkit called ETDSuite, containing a range of machine learning-based methods to process ETDs and their structured components, including page-level-segmentation, metadata-extraction, citation parsing, and metadata enhancement leveraging natural language processing and computer vision models. In this dissertation, we describe the machine learning models of four frameworks based on newly contributed evaluation benchmarks and performance while comparing them against the state-of-the-art.
Rights
In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
DOI
10.25777/h6qt-1p64
ISBN
9798302863447
Recommended Citation
Choudhury, Muntabir H..
"ETDSuite: A Toolkit to Mine Electronic Theses and Dissertations to Enrich Scholarly Big Data Using Natural Language Processing and Computer Vision"
(2024). Doctor of Philosophy (PhD), Dissertation, Computer Science, Old Dominion University, DOI: 10.25777/h6qt-1p64
https://digitalcommons.odu.edu/computerscience_etds/184
ORCID
0000-0002-9318-8844