Document Type

Conference Paper

Publication Date

2025

DOI

10.1145/3717867.371792

Publication Title

Websci '25: Proceedings of the 17th ACM Web Science Conference 2025

Pages

449-459

Conference Name

Websci '25: Proceedings of the 17th ACM Web Science Conference 2025, 20-24 May 2025, New Brunswick, New Jersey

Abstract

Software is often developed using versioned controlled software, such as Git, and hosted on centralized Web hosts, such as GitHub and GitLab. These Web hosted software repositories are made available to users in the form of traditional HTML Web pages for each source file and directory, as well as a presentational home page and various descriptive pages. We examined more than 12,000 Web hosted Git repository project home pages, primarily from GitHub, to measure how well their presentational components are preserved in the Internet Archive, as well as the source trees of the collected GitHub repositories to assess the extent to which their source code has been preserved. We found that more than 31% of the archived repository home pages examined exhibited some form of minor page damage and 1.6% exhibited major page damage. We also found that of the source trees analyzed, less than 5% of their source files were archived, on average, with the majority of repositories not having source files saved in the Internet Archive at all. The highest concentration of archived source files available were those linked directly from repositories’ home pages at a rate of 14.89% across all available repositories and sharply dropping off at deeper levels of a repository’s directory tree.

Rights

© 2025 Copyright held by the owner/authors.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License.

Original Publication Citation

Calano, D., Nelson, M., Weigle, M. (2025). GitHub repository complexity leads to diminished web archive availability. In M. Twyman, S. Rajtmajer, V. K. Singh, F. Morstatter, H. Liu, J. Sun, K. Ognyanova, & M. Weber (Eds.), Websci '25: Proceedings of the 17th ACM Web Science Conference 2025 (pp. 449-459). Association for Computing Machinery. https://doi.org/10.1145/3717867.3717920

ORCID

0000-0002-8710-2274 (Calano), 0000-0003-3749-8116 (Nelson), 0000-0002-2787-7166 (Weigle)

Share

COinS