Date of Award
Master of Science (MS)
Accurately parsing citation strings is key to automatically building large-scale citation graphs, so a robust citation parser is an essential module in academic search engines. One limitation of the state-of-the-art models (such as ParsCit and Neural-ParsCit) is the lack of a large-scale training corpus. Manually annotating hundreds of thousands of citation strings is laborious and time-consuming. This thesis presents a novel transformer-based citation parser by leveraging the GIANT dataset, consisting of 1 billion synthesized citation strings covering over 1500 citation styles. As opposed to handcrafted features, our model benefits from word embeddings and character-based embeddings by combining the bidirectional long shortterm memory (BiLSTM) with the Transformer and Conditional Random Forest (CRF). We varied the training data size from 500 to 1M and investigated the impact of training size on the performance. We evaluated our models on standard CORA benchmark and observed an increase in F1-score as the training size increased. The best performance happened when the training size was around 220K, achieving an F1-score of up to 100% on key citation fields. To our best knowledge, this is the first citation parser trained on a largescale synthesized dataset. Project codes and documentation can be found on this GitHub repository: https://github.com/lamps-lab/Citation-Parser.
Uddin, MD S..
"TransParsCit: A Transformer-Based Citation Parser Trained on Large-Scale Synthesized Data"
(2022). Master of Science (MS), Thesis, Computer Science, Old Dominion University, DOI: 10.25777/qrv9-m891