ETDPC -- A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations
College
College of Sciences
Department
Computer Science
Graduate Level
Doctoral
Presentation Type
Poster Presentation
Abstract
Electronic Theses and Dissertations (ETDs) have been proposed, advocated, and generated for over 25 years. However, this dataset is understudied due to its unique characteristics, such as length and varied content, which differ from regular conference proceedings and journal papers. ETDs, typically 100-400 pages long, serve as partial requirements for academic degrees and are hosted by university libraries or centralized systems like ProQuest. Despite their significance, ETD repositories lack efficient tools for content discovery, requiring the need for segmentation to facilitate better knowledge extraction.
The segmentation is crucial as ETDs differ from traditional scholarly papers, presenting challenges such as varied scanned image resolutions, complex document structures, limitations of the existing framework, and a lack of training samples. Using a bottom-up approach, we previously contributed datasets and methods to segment ETDs [1]. The method automatically annotated major structural components but still does not perform well in detecting minority classes (e.g., date, degree, equation, algorithms) due to a lack of training samples. Moreover, fine-tuning state-of-the-art models [2] [3] on ETDs do not generalize well (e.g., achieving 9% accuracy), and retraining them is non-trivial because of the lack of data.
Therefore, we take a top-down approach by designing a new framework called ETDPC (Electronic Theses and Dissertation Page Classifier), employing a two-stream multimodal model with a cross-attention network to classify ETD pages into 13 categories. ETDPC outperforms existing models, achieving an F1 score of 0.84 – 0.96 for 9 out of 13 categories. In addition, the paper’s contribution includes introducing methods for augmentation to generate pseudo-training samples for minority classes, contributing the ETD500 dataset with annotations, PNGs, text, and bounding boxes, and demonstrating the system’s robustness through quantitative analysis.
Keywords
Digital Libraries, AI, Multimodal, Vision Model (e.g., ResNet), Language Model (e.g., BERT with Talking Heads Attention), Data Augmentation
ETDPC -- A Multimodality Framework for Classifying Pages in Electronic Theses and Dissertations
Electronic Theses and Dissertations (ETDs) have been proposed, advocated, and generated for over 25 years. However, this dataset is understudied due to its unique characteristics, such as length and varied content, which differ from regular conference proceedings and journal papers. ETDs, typically 100-400 pages long, serve as partial requirements for academic degrees and are hosted by university libraries or centralized systems like ProQuest. Despite their significance, ETD repositories lack efficient tools for content discovery, requiring the need for segmentation to facilitate better knowledge extraction.
The segmentation is crucial as ETDs differ from traditional scholarly papers, presenting challenges such as varied scanned image resolutions, complex document structures, limitations of the existing framework, and a lack of training samples. Using a bottom-up approach, we previously contributed datasets and methods to segment ETDs [1]. The method automatically annotated major structural components but still does not perform well in detecting minority classes (e.g., date, degree, equation, algorithms) due to a lack of training samples. Moreover, fine-tuning state-of-the-art models [2] [3] on ETDs do not generalize well (e.g., achieving 9% accuracy), and retraining them is non-trivial because of the lack of data.
Therefore, we take a top-down approach by designing a new framework called ETDPC (Electronic Theses and Dissertation Page Classifier), employing a two-stream multimodal model with a cross-attention network to classify ETD pages into 13 categories. ETDPC outperforms existing models, achieving an F1 score of 0.84 – 0.96 for 9 out of 13 categories. In addition, the paper’s contribution includes introducing methods for augmentation to generate pseudo-training samples for minority classes, contributing the ETD500 dataset with annotations, PNGs, text, and bounding boxes, and demonstrating the system’s robustness through quantitative analysis.
Comments
References: [1] A. Ahuja, A. Devera, and E. A. Fox, “Parsing electronic theses and dissertations using object detection,” in Proceedings of the first Workshop on Information Extraction from Scientific Publications, (Online), pp. 121–130, Association for Computational Linguistics, Nov. 2022.
[2] Y. Xu, Y. Xu, T. Lv, L. Cui, F. Wei, G. Wang, Y. Lu, D. Florencio, C. Zhang, W. Che, M. Zhang, and L. Zhou, “LayoutLMv2: Multi-modal pre-training for visually-rich document understanding,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), (Online),
pp. 2579–2591, Association for Computational Linguistics, Aug. 2021.
[3] S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie, and R. Manmatha, “Docformer: End-to-end transformer for document understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 993–1003, October 2021.