Quality Assessment of Scholarly Big Data
Description/Abstract/Artist Statement
Scholarly big data is the rapid growth of scholarly data placed into digital networks and libraries. Some of the data associated with this research includes Scholarly Open Research Corpus data, Microsoft Academic Graph, and the US National Library of Medicine. These all use automated information extraction tools to collect metadata from scholarly articles. This automation introduces many sources of error due to the imperfections of models in extraction libraries. These various libraries are used for many areas in analytical research like citation analysis, citation prediction, information extraction, and link analysis. This research will come from the use of metadata provided by Semantic Scholar Open Research Corpus (S2ORC) which is compared to a ground truth dataset that is focused on assessing the data quality including document conflation (near-duplicate identification), paper linkage, author name disambiguation, coverage, and freshness. We found that the data linking of S2 quality is high but not perfect. The accuracies range from 0.91 to 0.99 depending on subject domains and data curation methods. Given that there are 200 million paper records in S2, data users should take this into account when performing data coverage and network analysis between S2 and other databases.
Faculty Advisor/Mentor
Jian Wu
College Affiliation
College of Sciences
Presentation Type
Poster
Disciplines
Databases and Information Systems | Data Science
Session Title
Interdisciplinary Research #8
Location
Zoom Room HH
Start Date
3-20-2021 3:00 PM
End Date
3-20-2021 3:55 PM
Quality Assessment of Scholarly Big Data
Zoom Room HH
Scholarly big data is the rapid growth of scholarly data placed into digital networks and libraries. Some of the data associated with this research includes Scholarly Open Research Corpus data, Microsoft Academic Graph, and the US National Library of Medicine. These all use automated information extraction tools to collect metadata from scholarly articles. This automation introduces many sources of error due to the imperfections of models in extraction libraries. These various libraries are used for many areas in analytical research like citation analysis, citation prediction, information extraction, and link analysis. This research will come from the use of metadata provided by Semantic Scholar Open Research Corpus (S2ORC) which is compared to a ground truth dataset that is focused on assessing the data quality including document conflation (near-duplicate identification), paper linkage, author name disambiguation, coverage, and freshness. We found that the data linking of S2 quality is high but not perfect. The accuracies range from 0.91 to 0.99 depending on subject domains and data curation methods. Given that there are 200 million paper records in S2, data users should take this into account when performing data coverage and network analysis between S2 and other databases.