Item

Advancing LLM Evaluation Through Open-Style Questions: The OpenLLMLeaderboard

Myrzakhan, Aidar
Department
Machine Learning
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
The rapid advancement of Large Language Models (LLMs) has led to significant improvements in natural language processing capabilities. However, current evaluation frameworks predominantly rely on multiple-choice questions (MCQs), which suffer from inherent limitations including selection bias and random guessing particularly problematic for smaller models. This thesis introduces the Open LLM-Leaderboard, a novel benchmark that transforms evaluation methodology by utilizing open-style questions instead of traditional multiple-choice formats. Unlike MCQs where models select from predefined options, open-style questions require models to generate answers independently, providing a more authentic assessment of their knowledge and reasoning abilities. Our approach implements a systematic coarsetofine filtering process to convert suitable MCQs into openstyle questions, creating a diverse benchmark spanning domains such as mathematics, medicine, science, and humanities. Through comprehensive experiments across various LLM architectures, we demonstrate that openstyle evaluation significantly reduces bias and provides more accurate differentiation between models of varying sizes. The Open LLMLeaderboard reveals that model rankings differ substantially from those produced by MCQ-based benchmarks, highlighting the importance of evaluation methodology in understanding true model capabilities. This thesis presents advancements in LLM evaluation methodology and practical applications for enhancing model performance, providing a comprehensive framework for more accurate assessment and improvement of language models in an increasingly diverse AI landscape.
Citation
Aidar Myrzakhan, “Advancing LLM Evaluation Through Open-Style Questions: The OpenLLMLeaderboard,” Master of Science thesis, Machine Learning, MBZUAI, 2025.
Source
Conference
Keywords
Large Language Model (LLM), Evaluation, Multiple-Choice Questions, Open-Style Questions
Subjects
Source
Publisher
DOI
Full-text link