Date of Award
Summer 8-2023
Document Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Science
Committee Director
Michael L. Nelson
Committee Member
Michele C. Weigle
Committee Member
Jian Wu
Abstract
The definition of scholarly content has expanded to include the data and source code that contribute to a publication. While major archiving efforts to preserve conventional scholarly content, typically in PDFs (e.g., LOCKSS, CLOCKSS, Portico), are underway, no analogous effort has yet emerged to preserve the data and code referenced in those PDFs, particularly the scholarly code hosted online on Git Hosting Platforms (GHPs). Similarly, Software Heritage is working to archive public source code, but there is value in archiving the surrounding ephemera that provide important context to the code while maintaining their original URIs. In current implementations, source code and its ephemera are not preserved, which presents a problem for scholarly projects where reproducibility matters. To quantify the scope of this issue, we analyzed the use of GHP URIs in the arXiv and PMC corpora. In total, there were 253,590 URIs to GitHub, SourceForge, Bitbucket, and GitLab repositories across the 2.64 million publications. Authors have increasingly included GHP URIs in scholarly publications and, in 2021, one in five arXiv publications included a GitHub URI. Next, we analyzed the archival coverage of scholarly GHP URIs in Web archives and Software Heritage. Overall, 79.15% of GHP URIs were archived in the Web archives while only 62.06% of GHP URIs were archived in Software Heritage. We used a machine learning classifier to identify other Open Access Data and Software (OADS) URIs outside of the four GHPs previously studied. We found almost 50,000 unique OADS hostnames and more non-GHP OADS URIs than GHP URIs. The prevalence of OADS URIs and vast number of unique hostnames points to the utility of a classifier to identify OADS URIs as opposed to manual enumeration. Lastly, we found a statistically significant relationship between the popularity of a GitHub repository as determined by engagement metrics and archival coverage indicating that less popular repositories less likely to be archived and, thus, more vulnerable to being unrecoverable. The growing use of GHPs in scholarly publications points to an urgent and growing need for dedicated efforts to archive their holdings in order to preserve research code and its scholarly ephemera.
Rights
In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).
DOI
10.25777/sczh-wh93
ISBN
9798380395007
Recommended Citation
Escamilla, Emily.
"Assessing the Prevalence and Archival Rate of URIs to Git Hosting Platforms in Scholarly Publications"
(2023). Master of Science (MS), Thesis, Computer Science, Old Dominion University, DOI: 10.25777/sczh-wh93
https://digitalcommons.odu.edu/computerscience_etds/142
ORCID
0000-0003-3845-7842
Included in
Archival Science Commons, Cataloging and Metadata Commons, Computer Engineering Commons, Computer Sciences Commons