A Comparative Study for Cross-Modality Evaluation of Text and Speech Instruction Tasks
Sameed, Ashba
Sameed, Ashba
Author
Supervisor
Department
Natural Language Processing
Embargo End Date
30/05/2025
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
The rapid advancements in multimodal learning have facilitated the development of models capable of processing and generating content across multiple modalities, such as text and speech. However, the evaluation of these models in cross-modal instruction following tasks remains an open challenge. This thesis focuses on systematically assessing the effectiveness of two state-of-the-art multi-modal models, Spirit-LM and Mini-Omni, in handling diverse text and speech-based instruction-output tasks. The primary objective of this research is to investigate the capability of these models to generate coherent and contextually relevant responses while switching between text and speech modalities. Specifically, we examine their performance when provided with input in one modality and expected to generate output in another, as well as cases where both input and output share the same modality. The key problem addressed in this study is the lack of standardized evaluation methodologies that can comprehensively measure the cross-modal generalization ability of such models. To achieve this, we conduct extensive experiments using a diverse set of datasets encompassing text and speech-based tasks. We evaluate text generation performance using widely accepted metrics such as accuracy, BLEU, and ROUGE scores, while speech output quality is assessed using the Mean Opinion Score (MOS). Additionally, we analyze
the robustness of these models by testing them on out-of-distribution data to assess their generalization across unseen instruction formats. Our findings reveal distinct strengths and limitations in each model’s ability to handle cross-modal tasks. Spirit-LM demonstrates superior coherence and consistency in text-totext and text-to-speech transformations, whereas Mini-Omni exhibits stronger performance in speech-based tasks but struggles with maintaining contextual accuracy in cross-modal scenarios. These insights provide valuable guidance for future research in optimizing multimodal architectures for real-world applications.
This study contributes to the broader field of multi-modal learning by identifying key factors influencing model performance in cross-modal settings. The results offer practical implications for improving model training strategies, designing robust evaluation frameworks, and developing more effective multi-modal AI systems for applications such as virtual assistants, automated transcription, and multi-modal content generation.
Citation
Ashba Sameed, “A Comparative Study for Cross-Modality Evaluation of Text and Speech Instruction Tasks,” Master of Science thesis, Natural Language Processing, MBZUAI, 2025.
Source
Conference
Keywords
Cross modality, Multi modal models, Spiritlm, Mini omni, Speech evaluations, Text evaluation
