Loading...
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Shikhar, Sambal ; Kurpath, Mohammed Irfan ; Mullappilly, Sahal Shaji ; Lahoud, Jean ; Khan, Fahad Shahbaz ; Anwer, Rao Muhammad ; Khan, Salman ; Cholakkal, Hisham
Shikhar, Sambal
Kurpath, Mohammed Irfan
Mullappilly, Sahal Shaji
Lahoud, Jean
Khan, Fahad Shahbaz
Anwer, Rao Muhammad
Khan, Salman
Cholakkal, Hisham
Files
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
License
http://creativecommons.org/licenses/by/4.0/
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX enables seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with minimal dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Evaluations demonstrate that LLMVoX matches or surpasses existing speech-enabled LLMs in both speech quality and latency, while maintaining the original linguistic strengths of the LLM. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training.
Citation
S. Shikhar, M.I. Kurpath, S.S. Mullappilly, J. Lahoud, F.S. Khan, R.M. Anwer, S. Khan, H. Cholakkal, "LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM," 2025, pp. 20481-20493.
Source
Findings of the Association for Computational Linguistics: ACL 2025
Conference
Findings of the Association for Computational Linguistics: ACL 2025
Keywords
Subjects
Source
Findings of the Association for Computational Linguistics: ACL 2025
Publisher
Association for Computational Linguistics
