CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset
Yan, Brian ; Hamed, Injy ; Shimizu, Shuichiro ; Lodagala, Vasista Sai ; Chen, William ; Iakovenko, Olga ; Talafha, Bashar ; Hussein, Amir ; Polok, Alexander ; Chang, Kalvin ... show 8 more
Yan, Brian
Hamed, Injy
Shimizu, Shuichiro
Lodagala, Vasista Sai
Chen, William
Iakovenko, Olga
Talafha, Bashar
Hussein, Amir
Polok, Alexander
Chang, Kalvin
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the three test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research.
Citation
B. Yan et al., “CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset,” pp. 743–747, 2025, doi: 10.21437/INTERSPEECH.2025-2247.
Source
Proceedings of Interspeech 2025
Conference
Interspeech 2025
Keywords
Code-Switching, Code-Switched Speech Recognition, Multilingual Speech Recognition and Translation
Subjects
Source
Interspeech 2025
Publisher
Delft University of Technology
