A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

Document Type

Conference Paper

Publication Date

2020

DOI

10.1145/3383583.3398590

Publication Title

Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, August 1-5, 2020, Virtual Event, China

Pages

2 pp.

Conference Name

ACM/IEEE Joint Conference on Digital Libraries in 2020, Virtual Event, August 1-5, 2020

Abstract

Extracting metadata from scholarly papers is an important text mining problem. Widely used open-source tools such as GROBID are designed for born-digital scholarly papers but often fail for scanned documents, such as Electronic Theses and Dissertations (ETDs). Here we present a preliminary baseline work with a heuristic model to extract metadata from the cover pages of scanned ETDs. The process started with converting scanned pages into images and then text files by applying OCR tools. Then a series of carefully designed regular expressions for each field is applied, capturing patterns for seven metadata fields: titles, authors, years, degrees, academic programs, institutions, and advisors. The method is evaluated on a ground truth dataset comprised of rectified metadata provided by the Virginia Tech and MIT libraries. Our heuristic method achieves an accuracy of up to 97% on the fields of the ETD text files. Our method poses a strong baseline for machine learning based methods. To our best knowledge, this is the first work attempting to extract metadata from non-born-digital ETDs.

Comments

Original Publication Citation

Choudhury, M. H., Wu, J., Ingram, W. A., & Fox, E. A. (2020). A heuristic baseline method for metadata extraction from scanned electronic theses and dissertations. Paper presented at the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2020), Virtual, August 1-5, 2020.

Repository Citation

Choudhury, M. H., Wu, J., Ingram, W. A., & Fox, E. A. (2020). A heuristic baseline method for metadata extraction from scanned electronic theses and dissertations. Paper presented at the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2020), Virtual, August 1-5, 2020.

ORCID

0000-0003-0173-4463 (Wu), 0000-0002-9318-8844 (Choudhury)

Computer Science Faculty Publications

A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

Document Type

Publication Date

DOI

Publication Title

Pages

Conference Name

Abstract

Comments

Original Publication Citation

Repository Citation

ORCID

Included in

Search

Browse

Contribute

Links

Contact Us

Computer Science Faculty Publications

A Heuristic Baseline Method for Metadata Extraction from Scanned Electronic Theses and Dissertations

Authors

Document Type

Publication Date

DOI

Publication Title

Pages

Conference Name

Abstract

Comments

Original Publication Citation

Repository Citation

ORCID

Included in

Share

Search

Browse

Contribute

Links

Contact Us