Scholarly Very Large Data: Challenges for Digital Libraries (White Paper)

Document Type

White Paper

Publication Date


Publication Title

Large Scale Networking (LSN) Workshop on Huge Data: A Computing, Networking, and Distributed Systems Perspective



Conference Name

Large Scale Networking (LSN) Workshop on Huge Data: A Computing, Networking and Distributed Systems Perspective, April 13-14, 2020, Chicago, Illinois


The volume of scholarly data has been growing exponentially over the last 50 years. The total size of the open access documents is estimated to be 35 million by 2022. The total amount of data to be handled, including crawled documents, production repository, metadata, extracted content, and their replications, can be as high as 350TB. Academic digital library search engines face signi!cant challenges in maintaining sustainable services. We discuss these challenges and propose feasible solutions to key modules in the digital library architecture including the document storage, data extraction, database and index. We use CiteSeerX as a case study.


© 2020 Association for Computing Machinery.

"ACM treats links as citations (references to objects) rather than as incorporations (embedding of objects). Permission is not needed to create links. ACM encourages the widespread distribution of links to the definitive version of records of its copyrighted works and does not require that authors obtain prior permission to include such links in their new works."

Original Publication Citation

Wu, J., & Giles, C. L. (2020) Scholarly very large data: Challenges for digital libraries [White paper]. Association for Computing Machinery.


0000-0003-0173-4463 (Wu)