Assessing Near Duplicity and Document Linking Fidelity of the Semantic Scholar Open Research Corpus
Description/Abstract/Artist Statement
Scholarly big data is the rapid growth of scholarly papers placed into digital repositories and libraries. Recently, the Allen Institutes for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly papers. Like many other scholarly big datasets, S2ORC contains automatically extracted metadata, which was further used for (1) disambiguating near-duplicate papers – papers that were written in different versions towards the same submission and (2) linking documents to external digital library databases. The imperfection of these quality metrics could impact downstream research such as citation analysis, citation prediction, and link analysis. In this project, we assessed (1) the near-duplicity quality of the S2ORC dataset and (2) the document linking fidelity using S2ORC metadata. We found that the data linking of S2 quality is high but not perfect. The accuracies range from 0.91 to 0.99 depending on subject domains and data curation methods. The near duplicity of this corpus is also imperfect. We identified up to 6000 near-duplicate articles in 150,000 randomly selected samples using different curation methods. Given that there are 200 million paper records in S2, data users should be aware of these caveats when performing data coverage and network analysis between S2 and other databases.
Faculty Advisor/Mentor
Jian Wu
College Affiliation
College of Sciences
Presentation Type
Poster
Disciplines
Data Science
Session Title
Poster Session
Location
Learning Commons @ Perry Library
Start Date
3-19-2022 9:00 AM
End Date
3-19-2022 11:00 AM
Upload File
wf_no
Assessing Near Duplicity and Document Linking Fidelity of the Semantic Scholar Open Research Corpus
Learning Commons @ Perry Library
Scholarly big data is the rapid growth of scholarly papers placed into digital repositories and libraries. Recently, the Allen Institutes for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly papers. Like many other scholarly big datasets, S2ORC contains automatically extracted metadata, which was further used for (1) disambiguating near-duplicate papers – papers that were written in different versions towards the same submission and (2) linking documents to external digital library databases. The imperfection of these quality metrics could impact downstream research such as citation analysis, citation prediction, and link analysis. In this project, we assessed (1) the near-duplicity quality of the S2ORC dataset and (2) the document linking fidelity using S2ORC metadata. We found that the data linking of S2 quality is high but not perfect. The accuracies range from 0.91 to 0.99 depending on subject domains and data curation methods. The near duplicity of this corpus is also imperfect. We identified up to 6000 near-duplicate articles in 150,000 randomly selected samples using different curation methods. Given that there are 200 million paper records in S2, data users should be aware of these caveats when performing data coverage and network analysis between S2 and other databases.