Loading...
Thumbnail Image
Item

Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

Dycke, Nils
Gurevych, Iryna
Citations
Google Scholar:
Altmetric:
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Journal article
Date
License
http://creativecommons.org/licenses/by/4.0/
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Abstract Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper’s results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.1
Citation
N. Dycke, I. Gurevych, "Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework," Transactions of the Association for Computational Linguistics, vol. 14, pp. 465-488, 2026, https://doi.org/10.1162/tacl.a.642.
Source
Transactions of the Association for Computational Linguistics
Conference
Keywords
46 Information and Computing Sciences, 4608 Human-Centred Computing
Subjects
Source
Publisher
MIT Press
Full-text link