Item

SpeechDialogueFactory: A Framework for Natural Speech Dialogue Generation

Wang, Minghan
Bai, Ye
Wang, Yuxia
Vu, Thuy-Trang
Shareghi, Ehsan
Haffari, Gholamreza
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
High-quality speech conversational datasets are essential for developing and evaluating Speech-LLMs. However, collecting real-world recordings presents significant challenges including high costs, privacy concerns, and inconsistent quality, while existing synthetic approaches often lack authenticity due to limited acoustic variety and insufficient paralinguistic information. We present SPEECHDIALOGUEFACTORY, a framework that addresses these limitations through a three-stage pipeline: generating comprehensive metadata, creating detailed scripts, and producing utterances enriched with paralinguistic features. Our framework retrieves speaker voices from a voice bank and leverages paralinguistic tags for expressive TTS. We also introduce an automated evaluation protocol that shows strong correlation with human assessments. Experimental results demonstrate that our synthesized dialogues achieve quality comparable to human recordings while offering greater flexibility and control.
Citation
M. Wang, Y. Bai, Y. Wang, T. T. Vu, E. Shareghi, and G. Haffari, “SpeechDialogueFactory: A Framework for Natural Speech Dialogue Generation,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1758–1762, 2025, doi: 10.21437/INTERSPEECH.2025-2013
Source
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Conference
26th Interspeech Conference 2025
Keywords
Data synthesis, dialogue generation, large language model, spoken dialogue
Subjects
Source
26th Interspeech Conference 2025
Publisher
International Speech Communication Association
Full-text link