Date of Award

Summer 8-2023

Document Type


Degree Name

Master of Science (MS)


Computer Science

Committee Director

Michael L. Nelson

Committee Member

Michele C. Weigle

Committee Member

Jian Wu


The definition of scholarly content has expanded to include the data and source code that contribute to a publication. While major archiving efforts to preserve conventional scholarly content, typically in PDFs (e.g., LOCKSS, CLOCKSS, Portico), are underway, no analogous effort has yet emerged to preserve the data and code referenced in those PDFs, particularly the scholarly code hosted online on Git Hosting Platforms (GHPs). Similarly, Software Heritage is working to archive public source code, but there is value in archiving the surrounding ephemera that provide important context to the code while maintaining their original URIs. In current implementations, source code and its ephemera are not preserved, which presents a problem for scholarly projects where reproducibility matters. To quantify the scope of this issue, we analyzed the use of GHP URIs in the arXiv and PMC corpora. In total, there were 253,590 URIs to GitHub, SourceForge, Bitbucket, and GitLab repositories across the 2.64 million publications. Authors have increasingly included GHP URIs in scholarly publications and, in 2021, one in five arXiv publications included a GitHub URI. Next, we analyzed the archival coverage of scholarly GHP URIs in Web archives and Software Heritage. Overall, 79.15% of GHP URIs were archived in the Web archives while only 62.06% of GHP URIs were archived in Software Heritage. We used a machine learning classifier to identify other Open Access Data and Software (OADS) URIs outside of the four GHPs previously studied. We found almost 50,000 unique OADS hostnames and more non-GHP OADS URIs than GHP URIs. The prevalence of OADS URIs and vast number of unique hostnames points to the utility of a classifier to identify OADS URIs as opposed to manual enumeration. Lastly, we found a statistically significant relationship between the popularity of a GitHub repository as determined by engagement metrics and archival coverage indicating that less popular repositories less likely to be archived and, thus, more vulnerable to being unrecoverable. The growing use of GHPs in scholarly publications points to an urgent and growing need for dedicated efforts to archive their holdings in order to preserve research code and its scholarly ephemera.


In Copyright. URI: This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).