Style Over Substance: Evaluation Biases for Large Language Models
Wu, Minghao ; Aji, Alham Fikri
Wu, Minghao
Aji, Alham Fikri
Author
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
As large language models (LLMs) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Ranking the relative performance of LLMs based on Elo ratings, according to human or LLM judgment, is becoming more popular. However, the extent to which humans and LLMs are capable of serving as reliable evaluators remains uncertain. This study investigates the behavior of crowd-sourced and expert annotators, as well as LLMs, when comparing outputs from different models. To achieve this, we curate a dataset of intentionally flawed, machine-generated answers. Our findings reveal a concerning bias in the evaluation process: answers with factual errors are rated more favorably than answers that are too short or contain grammatical errors. To address this issue, we propose independently evaluating machine-generated text across multiple dimensions, rather than merging all evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System (MERS). Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced evaluations, indicating the need for further investigation.
Citation
M. Wu and A. F. Aji, “Style Over Substance: Evaluation Biases for Large Language Models,” Proceedings - International Conference on Computational Linguistics, COLING, vol. Part, pp. 297–312, Jan. 2025,
Source
Proceedings - International Conference on Computational Linguistics, COLING
Conference
Keywords
Evaluation biases, Large Language Models (LLMs), Elo rating system, Multi-Elo Rating System (MERS), Factual accuracy?
Subjects
Source
Publisher
Association for Computational Linguistics
