A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations
Document Type
Conference Paper
Publication Date
2020
DOI
10.1145/3383583.3398590
Publication Title
Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, August 1-5, 2020, Virtual Event, China
Pages
2 pp.
Conference Name
ACM/IEEE Joint Conference on Digital Libraries in 2020, Virtual Event, August 1-5, 2020
Abstract
Extracting metadata from scholarly papers is an important text mining problem. Widely used open-source tools such as GROBID are designed for born-digital scholarly papers but often fail for scanned documents, such as Electronic Theses and Dissertations (ETDs). Here we present a preliminary baseline work with a heuristic model to extract metadata from the cover pages of scanned ETDs. The process started with converting scanned pages into images and then text files by applying OCR tools. Then a series of carefully designed regular expressions for each field is applied, capturing patterns for seven metadata fields: titles, authors, years, degrees, academic programs, institutions, and advisors. The method is evaluated on a ground truth dataset comprised of rectified metadata provided by the Virginia Tech and MIT libraries. Our heuristic method achieves an accuracy of up to 97% on the fields of the ETD text files. Our method poses a strong baseline for machine learning based methods. To our best knowledge, this is the first work attempting to extract metadata from non-born-digital ETDs.
Original Publication Citation
Choudhury, M. H., Wu, J., Ingram, W. A., & Fox, E. A. (2020). A heuristic baseline method for metadata extraction from scanned electronic theses and dissertations. Paper presented at the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2020), Virtual, August 1-5, 2020.
Repository Citation
Choudhury, M. H., Wu, J., Ingram, W. A., & Fox, E. A. (2020). A heuristic baseline method for metadata extraction from scanned electronic theses and dissertations. Paper presented at the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2020), Virtual, August 1-5, 2020.
ORCID
0000-0003-0173-4463 (Wu), 0000-0002-9318-8844 (Choudhury)
Comments
© 2020 held by the owner/authors.