Item

SciGrade: A Multi-Grade Scientific Text Simplification and Readability Dataset.

Khan, Fatimah Lyba
Department
Natural Language Processing
Embargo End Date
30/05/2025
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
This thesis investigates the use of Large Language Models (LLMs) for scientific text simplification, aiming to enhance accessibility across various educational levels while maintaining scientific accuracy. It addresses the challenge of simplifying complex scientific texts for a wide audience, including students, professionals, and the general public. A key component is the development of SciGrade, a multigrade dataset with simplified versions of scientific texts spanning 10 Science fields. This dataset serves as a benchmark for evaluating the performance of small and medium-scale LLMs in the simplification task. The thesis employs a multi-step evaluation framework, combining automated metrics (e.g., SARI, BERTScore, FKGL, FRE), LLM-based assessments, and human annotations. Larger models like Mistral-24B and Qwen-14B excel in similarity, accuracy, and readability. Self-reflection improves coherence and precision, offering scalable solutions for scientific communication. This work contributes to the advancement of AI-driven text simplification, with im plications for education, crossdisciplinary research, and public science communication, facilitating wider access to scientific knowledge across diverse audiences.
Citation
Fatimah Lyba Khan, “SciGrade: A Multi-Grade Scientific Text Simplification and Readability Dataset.,” Master of Science thesis, Natural Language Processing, MBZUAI, 2025.
Source
Conference
Keywords
Scientific Text Simplification, Educational Grade Levels, Large Language Model (LLM), Automated Metrics, SciGrade Dataset, Self-reflection
Subjects
Source
Publisher
DOI
Full-text link