Document Type
Conference Paper
Publication Date
2025
Publication Title
CEUR Workshop Proceedings: Proceedings of the Workshop on Artificial Intelligence and the Science of Sciences co-located with the 25th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2025)
Volume
4161
Pages
1-13
Conference Name
AI4SciSci 2025: Workshop on the Artificial Intelligence and the Science of Science, December 15-19, 2025, Virtual & DeKalb, Illinois, USA
Abstract
Verifying scientific claims is challenging for the general public because most people lack domain knowledge. Manual verification by subject domain experts is accurate, but it is obviously not scalable to meet the rising number of scientific claims on the Web. Whether the emerging large language models and large reasoning models can be used for scientific claim verification, and how their performances compare to humans, are still research questions. To this end, we developed a new benchmark MSVEC2 that consists of 138 claims from credible fact verification websites and science news outlets. Two tasks were given to both human and LLM participants. Task 1 requests the tester (LLMs or humans) to discern the truthfulness of claims using only prior knowledge. Task 2 requests testers to determine the stance of a scientific claim relative to an abstract of a research paper. The LLMs that were evaluated include GPT-3.5, GPT-4, GPT-4o, GPT-o1, and DeepSeek-R1. We recruited 23 college students in various majors to participate in the human study. We found that all LLMs score higher in F1 and accuracy compared to human testers in truthfulness classification (Task 1), with GPT-4o achieving the highest F1 score among all the models. The performance of LLMs in stance classification (Task 2) depended on the prompting configuration, with Chain-of-thought prompting yielding consistent improvements for all LLMs except GPT-o1. However, the best performance of LLMs is still not sufficient for reliable scientific claim verification under standard prompt settings.
Rights
© 2025 Copyright for this paper by its authors.
Use permitted under a Creative Commons License Attribution 4.0 International (CC BY 4.0) License.
Original Publication Citation
Curtis, B., Dzhaman, S., Maisonave, M., & Wu, J. (2025). Humans vs. LLMs on open domain scientific claim verification: A baseline study. CEUR Workshop Proceedings, 4161, 1-13. https://ceur-ws.org/Vol-4161/paper1.pdf
Repository Citation
Curtis, B., Dzhaman, S., Maisonave, M., & Wu, J. (2025). Humans vs. LLMs on open domain scientific claim verification: A baseline study. CEUR Workshop Proceedings, 4161, 1-13. https://ceur-ws.org/Vol-4161/paper1.pdf
ORCID
0009-0002-7518-9883 (Maisonave), 0000-0003-0173-4463 (Wu)
Included in
Artificial Intelligence and Robotics Commons, Information Literacy Commons, Scholarly Communication Commons
Comments
Publisher landing page: https://ceur-ws.org/Vol-4161/