Document Type

Conference Paper

Publication Date

2025

Publication Title

CEUR Workshop Proceedings: Proceedings of the Workshop on Artificial Intelligence and the Science of Sciences co-located with the 25th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2025)

Volume

4161

Pages

1-13

Conference Name

AI4SciSci 2025: Workshop on the Artificial Intelligence and the Science of Science, December 15-19, 2025, Virtual & DeKalb, Illinois, USA

Abstract

Verifying scientific claims is challenging for the general public because most people lack domain knowledge. Manual verification by subject domain experts is accurate, but it is obviously not scalable to meet the rising number of scientific claims on the Web. Whether the emerging large language models and large reasoning models can be used for scientific claim verification, and how their performances compare to humans, are still research questions. To this end, we developed a new benchmark MSVEC2 that consists of 138 claims from credible fact verification websites and science news outlets. Two tasks were given to both human and LLM participants. Task 1 requests the tester (LLMs or humans) to discern the truthfulness of claims using only prior knowledge. Task 2 requests testers to determine the stance of a scientific claim relative to an abstract of a research paper. The LLMs that were evaluated include GPT-3.5, GPT-4, GPT-4o, GPT-o1, and DeepSeek-R1. We recruited 23 college students in various majors to participate in the human study. We found that all LLMs score higher in F1 and accuracy compared to human testers in truthfulness classification (Task 1), with GPT-4o achieving the highest F1 score among all the models. The performance of LLMs in stance classification (Task 2) depended on the prompting configuration, with Chain-of-thought prompting yielding consistent improvements for all LLMs except GPT-o1. However, the best performance of LLMs is still not sufficient for reliable scientific claim verification under standard prompt settings.

Comments

Publisher landing page: https://ceur-ws.org/Vol-4161/

Rights

© 2025 Copyright for this paper by its authors. 

Use permitted under a Creative Commons License Attribution 4.0 International (CC BY 4.0) License.

Original Publication Citation

Curtis, B., Dzhaman, S., Maisonave, M., & Wu, J. (2025). Humans vs. LLMs on open domain scientific claim verification: A baseline study. CEUR Workshop Proceedings, 4161, 1-13. https://ceur-ws.org/Vol-4161/paper1.pdf

ORCID

0009-0002-7518-9883 (Maisonave), 0000-0003-0173-4463 (Wu)

Share

COinS