Item

A Unified Model for Text-to-Speech and Speech-to-Text

Toyin, Hawau Olamide
Department
Machine Learning
Embargo End Date
2024-01-01
Type
Thesis
Date
2024
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Objectives, resulting in two distinct large networks. However, SpeechT5 introduces a unified modal framework designed for self-supervised speech and text representation learning. It optimizes this learning process with a joint objective, aligning text and speech information into a unified semantic space. This framework features a shared transformer encoder-decoder architecture, accompanied by six auxiliary pre/post networks tailored for handling modal-specific data. To achieve alignment of the textual and speech information into a unified semantic space, a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units served as the interface between the encoder and decoder. The SpeechT5 framework has demonstrated superior performance across various spoken English language tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. However, these tasks are still approached individually, utilizing pre-trained weights from the self-supervised speech and text representation learning. In this study, our objective is to consolidate the training process for Arabic ASR and TTS, aiming to reduce computational demands while maintaining state-of-the-art performance. This is accomplished through a two-stage approach. Initially, we conducted pre-training on the SpeechT5 architecture using 1K hours of Arabic language speech data along with their corresponding transcriptions. Subsequently, the pre-trained model undergoes fine-tuning for downstream speech tasks, specifically ASR and TTS. Through thorough evaluation, we emphasize the significance of language-specific pre-training in enhancing the performance of downstream tasks. In the second stage, we endeavor to unify the training procedures for ASR and TTS. We achieve this by developing a unified automatic speech recognition and synthesis model, employing a transformer encoder, a task-specific decoder, and six auxiliary networks. These components are trained concurrently with a combined loss objective. Our evaluation demonstrates that the performance of our model is comparable to that of individually trained models. Additionally, we illustrate the effectiveness of utilizing pre-trained weights from the initial iteration to enhance the performance of our model."
Citation
H. Toyin, "A Unified Model for Text-to-Speech and Speech-to-Text", MS. Thesis, Machine Learning, MBZUAI, Abu Dhabi, UAE, 2024
Source
Conference
Keywords
Subjects
Source
Publisher
DOI
Full-text link