EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Levels in LLMs
Naeem, Numaan
Naeem, Numaan
Author
Supervisor
Department
Natural Language Processing
Embargo End Date
30/05/2025
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Large language models (LLMs) are rapidly changing the way we think about education. These powerful AI systems can answer questions, explain complex concepts, and even generate educational content across a range of subjects, from mathematics and physics to computer science. However, despite their impressive performance on standardized exams and academic benchmarks, many LLMs still struggle with a fundamental challenge: adapting their responses to suit the needs of students at different grade levels. This limitation becomes especially clear in K–12 education, where age-appropriate explanations and vocabulary are critical for effective learning. While older students and professionals can benefit greatly from LLM-generated content, younger learners often receive responses that are either too complex, too vague, or simply not developmentally appropriate. As digital learning tools become more common in classrooms and at home, the lack of grade-level adaptability in AI systems risks leaving younger students behind. In this thesis, we address this challenge by introducing a large-scale synthetic benchmark dataset specifically designed to evaluate how well LLMs can tailor their outputs for different educational stages. To the best of our knowledge, no existing dataset provides comprehensive coverage of all grade levels across any scientific domain. Our dataset fills this gap by offering grade-level annotated question-answer pairs across 11 distinct scientific fields, ranging from Grade 1 through Grade 12. It comprises nearly 52k QA pairs that span a wide range of complexity and subject matter, making it the first of its kind to support fine-grained evaluation of educational content generation at scale. We test a diverse set of LLMs, from compact 1.5 billion parameter models to state-of-the-art 24 billion parameter systems—to assess how effectively these models can generate grade-appropriate responses. Our findings reveal a consistent trend: while larger models generally outperform smaller ones, most still struggle with producing suitable content for early-grade learners (Grades 1–6). These results highlight the urgent need for improved fine-tuning methods, grade-aware prompting strategies, and training resources emphasizing readability, clarity, and cognitive appropriateness for different age groups. Ultimately, this work contributes both a novel dataset and an evalu ation framework for aligning LLM capabilities with the developmental needs of students.By addressing the challenge of grade-level adaptation, we take a significant step toward realizing the full potential of generative AI in education, making it more inclusive, effective, and responsive to learners at all stages.
Citation
Numaan Naeem, “EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Levels in LLMs,” Master of Science thesis, Natural Language Processing, MBZUAI, 2025.
Source
Conference
Keywords
Educational NLP, Large Language Model (LLM), Benchmarking
