Scholarly Big Data Quality Assessment: A Case Study of Document Linking and Conflation with S2ORC

Document Type

Conference Paper

Publication Date

2022

DOI

10.1145/3558100.3563850

Publication Title

DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering

Pages

16 (4 pp.)

Conference Name

ACM Symposium on Document Engineering 2022 (DocEng '22), September 20-23, 2022, San Jose, California

Abstract

Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly paper records. S2ORC contains a significant portion of automatically generated metadata. The metadata quality could impact downstream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document conflation rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation.

Rights

This work is licensed under a Creative Commons Attribution International 4.0 License.

Original Publication Citation

Wu, J., Hiltabrand, R., Soós, D., & Giles, C. L. (2022). Scholarly big data quality assessment: A case study of document linking and conflation with S2ORC. In DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering (Article 16, pp.1-4). Association for Computing Machinery. https://doi.org/10.1145/3558100.3563850

Repository Citation

Wu, J., Hiltabrand, R., Soós, D., & Giles, C. L. (2022). Scholarly big data quality assessment: A case study of document linking and conflation with S2ORC. In DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering (Article 16, pp.1-4). Association for Computing Machinery. https://doi.org/10.1145/3558100.3563850

ORCID

0000-0003-0173-4463 (Wu)

Computer Science Faculty Publications

Scholarly Big Data Quality Assessment: A Case Study of Document Linking and Conflation with S2ORC

Document Type

Publication Date

DOI

Publication Title

Pages

Conference Name

Abstract

Rights

Original Publication Citation

Repository Citation

ORCID

Included in

Search

Browse

Contribute

Links

Contact Us

Computer Science Faculty Publications

Scholarly Big Data Quality Assessment: A Case Study of Document Linking and Conflation with S2ORC

Authors

Document Type

Publication Date

DOI

Publication Title

Pages

Conference Name

Abstract

Rights

Original Publication Citation

Repository Citation

ORCID

Included in

Share

Search

Browse

Contribute

Links

Contact Us