Item

CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset

Yan, Brian
Hamed, Injy
Shimizu, Shuichiro
Lodagala, Vasista Sai
Chen, William
Iakovenko, Olga
Talafha, Bashar
Hussein, Amir
Polok, Alexander
Chang, Kalvin
... show 8 more
Research Projects
Organizational Units
Journal Issue
Abstract
We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the three test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research.
Citation
B. Yan et al., “CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset,” pp. 743–747, 2025, doi: 10.21437/INTERSPEECH.2025-2247.
Source
Proceedings of Interspeech 2025
Conference
Interspeech 2025
Keywords
Code-Switching, Code-Switched Speech Recognition, Multilingual Speech Recognition and Translation
Subjects
Source
Interspeech 2025
Publisher
Delft University of Technology
Full-text link