Document Type

Conference Paper

Publication Date

2022

DOI

10.1145/3558100.3563855

Publication Title

DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering

Pages

7 (1-4)

Conference Name

Doc Eng '22, September 20-23, 2022, Virtual, California.

Abstract

Theories and models, which are common in scientific papers in almost all domains, usually provide the foundations of theoretical analysis and experiments. Understanding the use of theories and models can shed light on the credibility and reproducibility of research works. Compared with metadata, such as title, author, keywords, etc., theory extraction in scientific literature is rarely explored, especially for social and behavioral science (SBS) domains. One challenge of applying supervised learning methods is the lack of a large number of labeled samples for training. In this paper, we propose an automated framework based on distant supervision that leverages entity mentions from Wikipedia to build a ground truth corpus consisting of more than 4500 automatically annotated sentences containing theory/model mentions. We use this corpus to train models for theory extraction in SBS papers. We compared four deep learning architectures and found the RoBERTa-BiLSTM-CRF is the best one with a precision as high as 89.72%. The model is promising to be conveniently extended to domains other than SBS. The code and data are publicly available at https://github.com/lamps-lab/theory.

Rights

© 2022 The Owner/Authors

This work is licensed under a Creative Commons Attribution International 4.0 License (CC BY 4.0).

Data Availability

Article states: The code and data are publicly available at: https://github.com/lamps-lab/theory

Original Publication Citation

Wei, X., Salsabil, L., & Wu, J. (2022). Theory entity extraction for social and behavioral sciences papers using distant supervision. In DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering (7). Association for Computing Machinery. https://doi.org/10.1145/3558100.3563855

ORCID

0000-0002-6162-2896 (Salsabil), 0000-0003-0173-4463 (Wu)

Share

COinS