Date of Award

Spring 5-2022

Document Type


Degree Name

Master of Science (MS)


Computer Science

Committee Director

Jian Wu

Committee Member

Vikas Ashok

Committee Member

Faryaneh Poursardar


Accurately parsing citation strings is key to automatically building large-scale citation graphs, so a robust citation parser is an essential module in academic search engines. One limitation of the state-of-the-art models (such as ParsCit and Neural-ParsCit) is the lack of a large-scale training corpus. Manually annotating hundreds of thousands of citation strings is laborious and time-consuming. This thesis presents a novel transformer-based citation parser by leveraging the GIANT dataset, consisting of 1 billion synthesized citation strings covering over 1500 citation styles. As opposed to handcrafted features, our model benefits from word embeddings and character-based embeddings by combining the bidirectional long shortterm memory (BiLSTM) with the Transformer and Conditional Random Forest (CRF). We varied the training data size from 500 to 1M and investigated the impact of training size on the performance. We evaluated our models on standard CORA benchmark and observed an increase in F1-score as the training size increased. The best performance happened when the training size was around 220K, achieving an F1-score of up to 100% on key citation fields. To our best knowledge, this is the first citation parser trained on a largescale synthesized dataset. Project codes and documentation can be found on this GitHub repository:


In Copyright. URI: This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).