Extracting Datasets, Methods, and Projects for ACL Anthology Papers (ODU PURS)
Description/Abstract/Artist Statement
A benefit of the increasingly interconnected world is the amount of information available to pull from, however this also results in an increased volume of noise when trying to find resources related to a particular topic of interest. Resources have been developed over the years to facilitate discovery of previously published research papers containing named entities, such as people, organizations, and locations but in order to find the datasets and methods used in the free text a human must manually read through the entirety of each document. This project develops a framework to automatically extract datasets and methods from scientific papers in the domain of Computer and Information Sciences and Engineering (CISE). We compared a heuristic method and a deep learning-based method, the latter of which was fine-tuned on a pre-trained language model. The ground truth was built by manually annotating a corpus of 500 abstracts of papers selected from the ACL Anthology, which was used for fine-turning the deep learning model and evaluation. The deep learning model plus a classifier outperforms the heuristic model in both simple and complex sentences.
Faculty Advisor/Mentor
Jian Wu
College Affiliation
College of Sciences
Presentation Type
Oral Presentation
Disciplines
Data Science
Session Title
Interdisciplinary Research #1
Location
Zoom Room A
Start Date
3-20-2021 9:00 AM
End Date
3-20-2021 9:55 AM
Extracting Datasets, Methods, and Projects for ACL Anthology Papers (ODU PURS)
Zoom Room A
A benefit of the increasingly interconnected world is the amount of information available to pull from, however this also results in an increased volume of noise when trying to find resources related to a particular topic of interest. Resources have been developed over the years to facilitate discovery of previously published research papers containing named entities, such as people, organizations, and locations but in order to find the datasets and methods used in the free text a human must manually read through the entirety of each document. This project develops a framework to automatically extract datasets and methods from scientific papers in the domain of Computer and Information Sciences and Engineering (CISE). We compared a heuristic method and a deep learning-based method, the latter of which was fine-tuned on a pre-trained language model. The ground truth was built by manually annotating a corpus of 500 abstracts of papers selected from the ACL Anthology, which was used for fine-turning the deep learning model and evaluation. The deep learning model plus a classifier outperforms the heuristic model in both simple and complex sentences.