Humans vs. LLMs on Open Domain Scientific Claim Verification: A Baseline Study

Document Type

Conference Paper

Publication Date

2025

Publication Title

CEUR Workshop Proceedings: Proceedings of the Workshop on Artificial Intelligence and the Science of Sciences co-located with the 25th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2025)

Volume

4161

Pages

1-13

Conference Name

AI4SciSci 2025: Workshop on the Artificial Intelligence and the Science of Science, December 15-19, 2025, Virtual & DeKalb, Illinois, USA

Abstract

Verifying scientific claims is challenging for the general public because most people lack domain knowledge. Manual verification by subject domain experts is accurate, but it is obviously not scalable to meet the rising number of scientific claims on the Web. Whether the emerging large language models and large reasoning models can be used for scientific claim verification, and how their performances compare to humans, are still research questions. To this end, we developed a new benchmark MSVEC2 that consists of 138 claims from credible fact verification websites and science news outlets. Two tasks were given to both human and LLM participants. Task 1 requests the tester (LLMs or humans) to discern the truthfulness of claims using only prior knowledge. Task 2 requests testers to determine the stance of a scientific claim relative to an abstract of a research paper. The LLMs that were evaluated include GPT-3.5, GPT-4, GPT-4o, GPT-o1, and DeepSeek-R1. We recruited 23 college students in various majors to participate in the human study. We found that all LLMs score higher in F1 and accuracy compared to human testers in truthfulness classification (Task 1), with GPT-4o achieving the highest F1 score among all the models. The performance of LLMs in stance classification (Task 2) depended on the prompting configuration, with Chain-of-thought prompting yielding consistent improvements for all LLMs except GPT-o1. However, the best performance of LLMs is still not sufficient for reliable scientific claim verification under standard prompt settings.

Comments

Publisher landing page: https://ceur-ws.org/Vol-4161/

Rights

Use permitted under a Creative Commons License Attribution 4.0 International (CC BY 4.0) License.

Original Publication Citation

Curtis, B., Dzhaman, S., Maisonave, M., & Wu, J. (2025). Humans vs. LLMs on open domain scientific claim verification: A baseline study. CEUR Workshop Proceedings, 4161, 1-13. https://ceur-ws.org/Vol-4161/paper1.pdf

Repository Citation

Curtis, B., Dzhaman, S., Maisonave, M., & Wu, J. (2025). Humans vs. LLMs on open domain scientific claim verification: A baseline study. CEUR Workshop Proceedings, 4161, 1-13. https://ceur-ws.org/Vol-4161/paper1.pdf

ORCID

0009-0002-7518-9883 (Maisonave), 0000-0003-0173-4463 (Wu)

Computer Science Faculty Publications

Humans vs. LLMs on Open Domain Scientific Claim Verification: A Baseline Study

Document Type

Publication Date

Publication Title

Volume

Pages

Conference Name

Abstract

Comments

Rights

Original Publication Citation

Repository Citation

ORCID

Included in

Search

Browse

Contribute

Links

Contact Us

Computer Science Faculty Publications

Humans vs. LLMs on Open Domain Scientific Claim Verification: A Baseline Study

Authors

Document Type

Publication Date

Publication Title

Volume

Pages

Conference Name

Abstract

Comments

Rights

Original Publication Citation

Repository Citation

ORCID

Included in

Share

Search

Browse

Contribute

Links

Contact Us