Computer Science Faculty Publications

Position: Benchmarking is Broken - Don't Let AI Be Its Own Judge

Zerui Cheng, Princeton University
Stella Wohnig, CISPA Helmholtz Center for Information Security
Ruchika Gupta, Michigan State University
Samiul Alam, Ohio State University
Tassallah Abdullahi, Brown University
João Alves Ribeiro, Massachusetts Institute of Technology
Christian Nielsen-Garcia, University of California, Los Angeles
Saif Mir, Ohio State University
Siran Li, University of Tubingen
Jason Orender, Old Dominion UniversityFollow
Seyed Ali Bahrainian, University of Tubingen
Daniel Kirste, Technical University of Munich
Aaron Gokaslan, Cornell University
Carsten Eickhoff, University of Tubingen
Pramod Viswanath, Princeton University
Ruben Wolff, Forest AI

Document Type

Report

Publication Date

2025

DOI

10.13140/RG.2.2.33834.94408/1

Pages

1-12

Abstract

The meteoric rise of Artificial Intelligence (AI), with its rapidly expanding market capitalization, presents both transformative opportunities and critical challenges. Chief among these is the urgent need for a new, unified paradigm for trustworthy evaluation, as current benchmarks increasingly reveal critical vulnerabilities. Issues like data contamination and selective reporting by model developers fuel hype, while inadequate data quality control can lead to biased evaluations that, even if unintentionally, may favor specific approaches. As a flood of participants enters the AI space, this "Wild West" of assessment makes distinguishing genuine progress from exaggerated claims exceptionally difficult. Such ambiguity blurs scientific signals and erodes public confidence, much as unchecked claims would destabilize financial markets reliant on credible oversight from agencies like Moody’s.

Human exams like the SAT or GRE have achieved recognized standards of fairness and credibility; why settle for less in evaluating AI, especially given its profound societal impact? This position paper argues that the current laissez-faire approach is unsustainable. We contend that true, sustainable AI advancement demands a paradigm shift: a unified, live, and quality-controlled benchmarking framework robust by construction, not by mere courtesy and goodwill. To this end, we dissect the systemic flaws undermining today’s AI evaluation, distill the essential requirements for a new generation of assessments, and introduce a roadmap embodying this paradigm. Our goal is to pave the way for evaluations that can restore integrity and deliver genuinely trustworthy measures of AI progress.

Comments

This report is early-stage research and may not have been peer reviewed yet.

Rights

Published under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License.

Original Publication Citation

Cheng, Z., Wohnig, S., Gupta, R., Alam, S., Abdullahi, T., Alves Ribeiro, J., Nielsen-Garcia, C., Mir, S., Li, S., Orender, J., Bahrainian, S. A., Kirste, D., Gokaslan, A., Weickhoff, C., Viswanath, P., & Wolff, R. (2025). Position: Benchmarking is broken-Don't let AI be its own judge. https://doi.org/10.13140/RG.2.2.33834.94408/1

Repository Citation

ORCID

0000-0001-7396-9996 (Orender)

Download

Included in

Artificial Intelligence and Robotics Commons, Information Security Commons, Science and Technology Policy Commons

COinS

Computer Science Faculty Publications

Position: Benchmarking is Broken - Don't Let AI Be Its Own Judge

Document Type

Publication Date

DOI

Pages

Abstract

Comments

Rights

Original Publication Citation

Repository Citation

ORCID

Included in

Search

Browse

Contribute

Links

Contact Us

Computer Science Faculty Publications

Position: Benchmarking is Broken - Don't Let AI Be Its Own Judge

Authors

Document Type

Publication Date

DOI

Pages

Abstract

Comments

Rights

Original Publication Citation

Repository Citation

ORCID

Included in

Share

Search

Browse

Contribute

Links

Contact Us