Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Cahyawijaya, Samuel ; Lovenia, Holy ; Moniz, Joel Ruben Antony ; Wong, Tack Hwa ; Farhansyah, Mohammad Rifqi ; Maung, Thant Thiri ; Hudi, Frederikus ; Anugraha, David ; Habibi, Muhammad Ravi Shulthan ; Qorib, Muhammad Reza ... show 10 more
Cahyawijaya, Samuel
Lovenia, Holy
Moniz, Joel Ruben Antony
Wong, Tack Hwa
Farhansyah, Mohammad Rifqi
Maung, Thant Thiri
Hudi, Frederikus
Anugraha, David
Habibi, Muhammad Ravi Shulthan
Qorib, Muhammad Reza
Author
Cahyawijaya, Samuel
Lovenia, Holy
Moniz, Joel Ruben Antony
Wong, Tack Hwa
Farhansyah, Mohammad Rifqi
Maung, Thant Thiri
Hudi, Frederikus
Anugraha, David
Habibi, Muhammad Ravi Shulthan
Qorib, Muhammad Reza
Agarwal, Amit
Imperial, Joseph Marvin
Patel, Hitesh Laxmichand
Feliren, Vicky
Nasution, Bahrul Ilmi
Rufino, Manuel Antonio
Winata, Genta Indra
Rajagede, Rian Adam
Catalan, Carlos Rafael
Imam, Mohamed Fazli Mohamed
Pattnayak, Priyaranjan
Pranida, Salsabila Zahirah
Pratama, Kevin
Bangera, Yeshil
Na-Thalang, Adisai
Monderin, Patricia Nicole
Song, Yueqi
Simon, Christian
Ng, Lynnette Hui Xian
Sapan, Richardy Lobo
Rafi, Taki Hasan
Wang, Bin
Supryadi, —
Veerakanjana, Kanyakorn
Ittichaiwong, Piyalitt
Roque, Matthew Theodore
Vincentio, Karissa
Kreangphet, Takdanai
Artkaew, Phakphum
Palgunadi, Kadek Hendrawan
Yu, Yanzhi
Hastuti, Rochana Prih
Nixon, William
Bangera, Mithil
Lim, Adrian Xuan Wei
Khine, Aye Hninn
Zhafran, Hanif Muhammad
Ferdinan, Teddy
Izzani, Audra Aurora
Singh, Ayushman
Evan, Evan
Krito, Jauza Akbar
Anugraha, Michael
Ilasariya, Fenal Ashokbhai
Li, Haochen
Daniswara, John Amadeo
Tjiaranata, Filbert Aurelian
Yulianrifat, Eryawan Presma
Udomcharoenchaikit, Can
Ansori, Fadil Risdian
Ihsani, Mahardika Krisna
Nguyen, Giang
Barik, Anab Maulana
Velasco, Dan John
Genadi, Rifo Ahmad
Saha, Saptarshi
Wei, Chengwei
Flores, Isaiah Edri W.
Ko Han, Kenneth Chen
Santos, Anjela Gail D.
Lim, Wan Shen
Phyo, Kaung Si
Santos, Tim
Dwiastuti, Meisyarah
Luo, Jiayun
Cruz, Jan Christian Blaise
Hee, Ming Shan
Hanif, Ikhlasul Akmal
Al Hakim, M. Alif
Sya’ban, Muhammad Rizky
Kerdthaisong, Kun
Miranda, Lester James Validad
Koto, Fajri
Fatyanosa, Tirana Noor
Aji, Alham Fikri
Rosal, Jostin Jerico
Kevin, Jun
Wijaya, Robert
Kampman, Onno P.
Zhang, Ruochen
Karlsson, Börje F.
Limkonchotiwat, Peerat
Lovenia, Holy
Moniz, Joel Ruben Antony
Wong, Tack Hwa
Farhansyah, Mohammad Rifqi
Maung, Thant Thiri
Hudi, Frederikus
Anugraha, David
Habibi, Muhammad Ravi Shulthan
Qorib, Muhammad Reza
Agarwal, Amit
Imperial, Joseph Marvin
Patel, Hitesh Laxmichand
Feliren, Vicky
Nasution, Bahrul Ilmi
Rufino, Manuel Antonio
Winata, Genta Indra
Rajagede, Rian Adam
Catalan, Carlos Rafael
Imam, Mohamed Fazli Mohamed
Pattnayak, Priyaranjan
Pranida, Salsabila Zahirah
Pratama, Kevin
Bangera, Yeshil
Na-Thalang, Adisai
Monderin, Patricia Nicole
Song, Yueqi
Simon, Christian
Ng, Lynnette Hui Xian
Sapan, Richardy Lobo
Rafi, Taki Hasan
Wang, Bin
Supryadi, —
Veerakanjana, Kanyakorn
Ittichaiwong, Piyalitt
Roque, Matthew Theodore
Vincentio, Karissa
Kreangphet, Takdanai
Artkaew, Phakphum
Palgunadi, Kadek Hendrawan
Yu, Yanzhi
Hastuti, Rochana Prih
Nixon, William
Bangera, Mithil
Lim, Adrian Xuan Wei
Khine, Aye Hninn
Zhafran, Hanif Muhammad
Ferdinan, Teddy
Izzani, Audra Aurora
Singh, Ayushman
Evan, Evan
Krito, Jauza Akbar
Anugraha, Michael
Ilasariya, Fenal Ashokbhai
Li, Haochen
Daniswara, John Amadeo
Tjiaranata, Filbert Aurelian
Yulianrifat, Eryawan Presma
Udomcharoenchaikit, Can
Ansori, Fadil Risdian
Ihsani, Mahardika Krisna
Nguyen, Giang
Barik, Anab Maulana
Velasco, Dan John
Genadi, Rifo Ahmad
Saha, Saptarshi
Wei, Chengwei
Flores, Isaiah Edri W.
Ko Han, Kenneth Chen
Santos, Anjela Gail D.
Lim, Wan Shen
Phyo, Kaung Si
Santos, Tim
Dwiastuti, Meisyarah
Luo, Jiayun
Cruz, Jan Christian Blaise
Hee, Ming Shan
Hanif, Ikhlasul Akmal
Al Hakim, M. Alif
Sya’ban, Muhammad Rizky
Kerdthaisong, Kun
Miranda, Lester James Validad
Koto, Fajri
Fatyanosa, Tirana Noor
Aji, Alham Fikri
Rosal, Jostin Jerico
Kevin, Jun
Wijaya, Robert
Kampman, Onno P.
Zhang, Ruochen
Karlsson, Börje F.
Limkonchotiwat, Peerat
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Despite Southeast Asia's (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method's effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ∼85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.
Citation
S. Cahyawijaya et al., “Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia,” vol. 1, pp. 18685–18717, Aug. 2025, doi: 10.18653/V1/2025.ACL-LONG.916.
Source
Proceedings of the Annual Meeting of the Association for Computational Linguistics
Conference
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Keywords
Multicultural Vision-Language Dataset, Southeast Asia, Crowdsourcing vs Crawling vs Generation, Cultural Relevance in VL, Image-Text Dataset Creation, Multilingual Multimodal Benchmark, Under-represented Languages in AI, Data Collection Trade-offs
Subjects
Source
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Publisher
Association for Computational Linguistics
