Towards Aiding Research by Improving Access to Electronic Theses and Dissertations from Multiple Domains

Document Type

Presentation

Publication Date

2021

Conference Name

2021 Fall CNI Virtual Membership Meeting, December 7-9, 2021, Virtual, Online

Abstract

Funded by the Institute of Museum and Library Services, Virginia Tech and Old Dominion University are collaborating on a project aimed at bringing computational access to book-length documents, and demonstrating that process with electronic theses and dissertations (ETDs). Since the project launch, the team has made substantial progress on various tasks, including data acquisition, information extraction, and classification. The team has collected the largest corpus of ETDs containing about 500,000 full-text documents and their metadata. The collection was made by actively crawling institutional ETD repositories of university libraries in the United States, honoring the crawling policies of target websites. To facilitate building robust text representations for downstream tasks, we investigated building a language model specific to ETDs. This model, called ETDBERT, was built by fine-tuning Bidirectional Encoder Representations from Transformers (BERT) using a corpus containing 300 million tokens extracted from a subset of ETDs we collected across 195 disciplines. ETDBERT was evaluated based on intrinsic and extrinsic metrics and demonstrated superior performance compared with traditional text representations on a subject domain classification task. Compared with SciBERT, which was trained on a single Tensor Processing Unit (TPU) for seven days, training ETDBERT uses far fewer resources while achieving comparable performance on a subject domain classification task. We attribute this to the multi-disciplinary sampling of our training corpus. Our planned further improved language model will help us even more with tasks, such as novelty measurement, automatic subject categorization, and long text summarization, to better understand the nuances of knowledge in ETDs, and to provide robust and scalable related services.

Comments

Project workers: Bipasha Banerjee, Muntabir Choudhury, Himarsha Jayanetti, Md Sami Uddin, Lamia Salsabil, Neel C. Kawitkar, Richard Pates, Pooja Sonmale, Adheesh Sunil Juvekar, Eman Abdelrahman, Fatimah Alotaibi, Palakh Mignonne Jude, Sampanna Kahu, John Aromando, Gunnar Reiske, Winston Shields.

Original Publication Citation

Wu, J., Ingram, W. A., & Fox, E. A. (2021, December 7) Towards aiding research by improving access to electronic theses and dissertations from multiple domains [Video]. Youtube. https://www.youtube.com/watch?v=Gt4ks8fOZtE

ORCID

0000-0003-0173-4463 (Wu)

Share

COinS