Document Type

Conference Paper

Publication Date




Publication Title

DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering


7 (1-4)

Conference Name

Doc Eng '22, September 20-23, 2022, Virtual, California.


Theories and models, which are common in scientific papers in almost all domains, usually provide the foundations of theoretical analysis and experiments. Understanding the use of theories and models can shed light on the credibility and reproducibility of research works. Compared with metadata, such as title, author, keywords, etc., theory extraction in scientific literature is rarely explored, especially for social and behavioral science (SBS) domains. One challenge of applying supervised learning methods is the lack of a large number of labeled samples for training. In this paper, we propose an automated framework based on distant supervision that leverages entity mentions from Wikipedia to build a ground truth corpus consisting of more than 4500 automatically annotated sentences containing theory/model mentions. We use this corpus to train models for theory extraction in SBS papers. We compared four deep learning architectures and found the RoBERTa-BiLSTM-CRF is the best one with a precision as high as 89.72%. The model is promising to be conveniently extended to domains other than SBS. The code and data are publicly available at


© 2022 The Owner/Authors

This work is licensed under a Creative Commons Attribution International 4.0 License (CC BY 4.0).

Data Availability

Article states: The code and data are publicly available at:

Original Publication Citation

Wei, X., Salsabil, L., & Wu, J. (2022). Theory entity extraction for social and behavioral sciences papers using distant supervision. In DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering (7). Association for Computing Machinery.


0000-0002-6162-2896 (Salsabil), 0000-0003-0173-4463 (Wu)