Date of Award

Spring 5-2022

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

Committee Director

Jian Wu

Committee Member

Vikas Ashok

Committee Member

Faryaneh Poursardar

Abstract

Accurately parsing citation strings is key to automatically building large-scale citation graphs, so a robust citation parser is an essential module in academic search engines. One limitation of the state-of-the-art models (such as ParsCit and Neural-ParsCit) is the lack of a large-scale training corpus. Manually annotating hundreds of thousands of citation strings is laborious and time-consuming. This thesis presents a novel transformer-based citation parser by leveraging the GIANT dataset, consisting of 1 billion synthesized citation strings covering over 1500 citation styles. As opposed to handcrafted features, our model benefits from word embeddings and character-based embeddings by combining the bidirectional long shortterm memory (BiLSTM) with the Transformer and Conditional Random Forest (CRF). We varied the training data size from 500 to 1M and investigated the impact of training size on the performance. We evaluated our models on standard CORA benchmark and observed an increase in F1-score as the training size increased. The best performance happened when the training size was around 220K, achieving an F1-score of up to 100% on key citation fields. To our best knowledge, this is the first citation parser trained on a largescale synthesized dataset. Project codes and documentation can be found on this GitHub repository: https://github.com/lamps-lab/Citation-Parser.

Rights

In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).

DOI

10.25777/qrv9-m891

ISBN

9798834007210

Recommended Citation

Uddin, MD S.. "TransParsCit: A Transformer-Based Citation Parser Trained on Large-Scale Synthesized Data" (2022). Master of Science (MS), Thesis, Computer Science, Old Dominion University, DOI: 10.25777/qrv9-m891
https://digitalcommons.odu.edu/computerscience_etds/133

Download

Included in

Computer Sciences Commons

COinS

ODU Digital Commons

Computer Science Theses & Dissertations

TransParsCit: A Transformer-Based Citation Parser Trained on Large-Scale Synthesized Data

Date of Award

Document Type

Degree Name

Department

Committee Director

Committee Member

Committee Member

Abstract

Rights

DOI

ISBN

Recommended Citation

Included in

Search

Browse

Contribute

Links

Contact Us

ODU Digital Commons

Computer Science Theses & Dissertations

TransParsCit: A Transformer-Based Citation Parser Trained on Large-Scale Synthesized Data

Author

Date of Award

Document Type

Degree Name

Department

Committee Director

Committee Member

Committee Member

Abstract

Rights

DOI

ISBN

Recommended Citation

Included in

Share

Search

Browse

Contribute

Links

Contact Us