Item

Hallucinations in Speech Models Under Distribution Shifts: Comprehensive Detection and Analysis

Atwany, Hanin Zeyad
Department
Natural Language Processing
Embargo End Date
30/05/2025
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Speech foundation models trained at a massive scale, both in terms of model and data size, result in robust systems capable of performing multiple speech tasks, including automatic speech recognition (ASR). These models transcend language and domain barriers, yet effectively measuring their performance remains a challenge. Traditional metrics like word error rate (WER) and character error rate (CER) are commonly used to evaluate ASR performance but often fail to reflect transcription quality in critical contexts, partic ularly when detecting fabricated outputs. This phenomenon, known as hallucination, is especially concerning in high-stakes domains such as healthcare, legal, and aviation, where errors can have severe consequences. In this work, our objective is to study the impact of domain-specific knowledge, model structures (encoder vs. encoder-decoder), model size, and training paradigm (self-supervised vs. un-supervised) on Hallucination Error Rate (HER) and highlight the importance of quantifying hallucination. This is accomplished through an indepth analysis of 20 ASR models in synthetic noise (e.g., adversarial, white noise, pitch shift) and natural distribution shifts (e.g., medical/legal categories), quantifying their impact on hallucination error rate (HER). We also introduce an LLM-based error detection framework that classifies the hallucination error rate at two levels: coarse-grained and fine-grained. Our analysis reveals four critical findings: (1) Traditional metrics are misleading: Low hallucination rates can be masked by high WERs, and dangerous hallucinations can be masked by low WERs. (2) Noise increases HER: Adversarial perturbations have a greater impact on HER in comparison to common synthetic noise (e.g., white noise). (3) Distribution shifts are highly correlated: HER is strongly correlated with domain shifts (α=0.91), which suggests the risks of deploying ASR models under mismatched conditions. (4) Scaling laws plateau: The benefits of scaling up model size diminish beyond a certain point (394 M for whisper). These findings emphasize the necessity of utilizing HER alongside WER/CER for thorough ASR assessment, particularly in safety-critical areas. By uncovering limitations of existing metrics and offering insights into hallucination hazards, our work enhances rigorous evaluation protocols for speech systems in the future.
Citation
Hanin Zeyad Atwany, “Hallucinations in Speech Models Under Distribution Shifts: Comprehensive Detection and Analysis,” Master of Science thesis, Natural Language Processing, MBZUAI, 2025.
Source
Conference
Keywords
Hallucination, Speech, Foundation models, Encoder, Decoder, WER
Subjects
Source
Publisher
DOI
Full-text link