Quality Assessment of Scholarly Big Data

Description/Abstract/Artist Statement

Scholarly big data is the rapid growth of scholarly data placed into digital networks and libraries. Some of the data associated with this research includes Scholarly Open Research Corpus data, Microsoft Academic Graph, and the US National Library of Medicine. These all use automated information extraction tools to collect metadata from scholarly articles. This automation introduces many sources of error due to the imperfections of models in extraction libraries. These various libraries are used for many areas in analytical research like citation analysis, citation prediction, information extraction, and link analysis. This research will come from the use of metadata provided by Semantic Scholar Open Research Corpus (S2ORC) which is compared to a ground truth dataset that is focused on assessing the data quality including document conflation (near-duplicate identification), paper linkage, author name disambiguation, coverage, and freshness. We found that the data linking of S2 quality is high but not perfect. The accuracies range from 0.91 to 0.99 depending on subject domains and data curation methods. Given that there are 200 million paper records in S2, data users should take this into account when performing data coverage and network analysis between S2 and other databases.

Presenting Author Name/s

Ryan Hiltabrand

Faculty Advisor/Mentor

Jian Wu

College Affiliation

College of Sciences

Presentation Type

Poster

Disciplines

Databases and Information Systems | Data Science

Session Title

Interdisciplinary Research #8

Location

Zoom Room HH

Start Date

3-20-2021 3:00 PM

End Date

3-20-2021 3:55 PM

This document is currently not available here.

Share

COinS
 
Mar 20th, 3:00 PM Mar 20th, 3:55 PM

Quality Assessment of Scholarly Big Data

Zoom Room HH

Scholarly big data is the rapid growth of scholarly data placed into digital networks and libraries. Some of the data associated with this research includes Scholarly Open Research Corpus data, Microsoft Academic Graph, and the US National Library of Medicine. These all use automated information extraction tools to collect metadata from scholarly articles. This automation introduces many sources of error due to the imperfections of models in extraction libraries. These various libraries are used for many areas in analytical research like citation analysis, citation prediction, information extraction, and link analysis. This research will come from the use of metadata provided by Semantic Scholar Open Research Corpus (S2ORC) which is compared to a ground truth dataset that is focused on assessing the data quality including document conflation (near-duplicate identification), paper linkage, author name disambiguation, coverage, and freshness. We found that the data linking of S2 quality is high but not perfect. The accuracies range from 0.91 to 0.99 depending on subject domains and data curation methods. Given that there are 200 million paper records in S2, data users should take this into account when performing data coverage and network analysis between S2 and other databases.