Item

Rebooting Language Models for Speech

Djanibekov, Amirbek
Department
Natural Language Processing
Embargo End Date
20/05/2026
Type
Thesis
Date
2024
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Integrating speech directly into the text domain has significantly improved the traditional two-step process of converting speech to text and then processing the text. Recent publications even showcase integrating the Large Language Model for context recognition of speech modality. While most methods employ the output of the intermediate layer of the pre-trained models or direct placement of speech hidden representation instead of text embedding space, there is potential in exploring alternative approaches that use querying text information from speech representation context. Exploring alternative methods that derive text information directly from the context of speech representations presents opportunities for efficiency improvements, such as reduced storage needs, parameter efficient computation, etc.. In this study, we propose a new training protocol for speech that utilizes speech codes from the neural encodec model in Automatic Speech Recognition and Automatic Speech Translation tasks, which re-frames sequence classification objectives to generative. Our experiments on the LibriSpeech dataset reveals that our proposed method is effective, though it encounters some challenges with accurately matching the target text. Through evaluating the model s performance against established benchmarks, we infer that the generated outputs bear a high correlation with the semantic representation of the gold standard labels.
Citation
A. Djanibekov, "Rebooting Language Models for Speech", M.S. Thesis, Natural Language Processing, MBZUAI, Abu Dhabi, UAE, 2024.
Source
Conference
Keywords
Subjects
Source
Publisher
DOI
Full-text link