Assessing Near Duplicity and Document Linking Fidelity of the Semantic Scholar Open Research Corpus

Description/Abstract/Artist Statement

Scholarly big data is the rapid growth of scholarly papers placed into digital repositories and libraries. Recently, the Allen Institutes for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly papers. Like many other scholarly big datasets, S2ORC contains automatically extracted metadata, which was further used for (1) disambiguating near-duplicate papers – papers that were written in different versions towards the same submission and (2) linking documents to external digital library databases. The imperfection of these quality metrics could impact downstream research such as citation analysis, citation prediction, and link analysis. In this project, we assessed (1) the near-duplicity quality of the S2ORC dataset and (2) the document linking fidelity using S2ORC metadata. We found that the data linking of S2 quality is high but not perfect. The accuracies range from 0.91 to 0.99 depending on subject domains and data curation methods. The near duplicity of this corpus is also imperfect. We identified up to 6000 near-duplicate articles in 150,000 randomly selected samples using different curation methods. Given that there are 200 million paper records in S2, data users should be aware of these caveats when performing data coverage and network analysis between S2 and other databases.

Presenting Author Name/s

Ryan Hiltabrand

Faculty Advisor/Mentor

Jian Wu

College Affiliation

College of Sciences

Presentation Type

Poster

Disciplines

Data Science

Session Title

Poster Session

Location

Learning Commons @ Perry Library

Start Date

3-19-2022 9:00 AM

End Date

3-19-2022 11:00 AM

Upload File

wf_no

This document is currently not available here.

Share

COinS
 
Mar 19th, 9:00 AM Mar 19th, 11:00 AM

Assessing Near Duplicity and Document Linking Fidelity of the Semantic Scholar Open Research Corpus

Learning Commons @ Perry Library

Scholarly big data is the rapid growth of scholarly papers placed into digital repositories and libraries. Recently, the Allen Institutes for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly papers. Like many other scholarly big datasets, S2ORC contains automatically extracted metadata, which was further used for (1) disambiguating near-duplicate papers – papers that were written in different versions towards the same submission and (2) linking documents to external digital library databases. The imperfection of these quality metrics could impact downstream research such as citation analysis, citation prediction, and link analysis. In this project, we assessed (1) the near-duplicity quality of the S2ORC dataset and (2) the document linking fidelity using S2ORC metadata. We found that the data linking of S2 quality is high but not perfect. The accuracies range from 0.91 to 0.99 depending on subject domains and data curation methods. The near duplicity of this corpus is also imperfect. We identified up to 6000 near-duplicate articles in 150,000 randomly selected samples using different curation methods. Given that there are 200 million paper records in S2, data users should be aware of these caveats when performing data coverage and network analysis between S2 and other databases.