Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language Models
Zeng, Bo ; Lyu, Chenyang ; Liu, Sinuo ; Zeng, Mingyan ; Wu, Minghao ; Ni, Xuanfan ; Shi, Tianqi ; Liu, Yefeng ; Zhu, Chenyu ; Li, Ruizhe ... show 6 more
Zeng, Bo
Lyu, Chenyang
Liu, Sinuo
Zeng, Mingyan
Wu, Minghao
Ni, Xuanfan
Shi, Tianqi
Liu, Yefeng
Zhu, Chenyu
Li, Ruizhe
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Instruction-following capability has become a major ability to be evaluated for Large Language Models (LLMs) (Brown et al., 2020; OpenAI, 2023; Bai et al., 2023). However, existing datasets, such as IFEval (Zhou et al., 2023; Zeng et al., 2024), are either predominantly monolingual and centered on English or simply machine translated to other languages, limiting their applicability in multilingual contexts. In this paper, we present an carefully-curated extension of IFEval to a localized multilingual version named Marco-Bench-MIF, covering 30 languages with varying levels of localization. Our benchmark addresses linguistic constraints (e.g., modifying capitalization requirements for Chinese) and cultural references (e.g., substituting region-specific company names in prompts) via a hybrid pipeline combining translation with verification. Through comprehensive evaluation of 20+ LLMs on our Marco-Bench-MIF, we found that: (1) 25-35% accuracy gap between high/low-resource languages, (2) model scales largely impact performance by 45-60% yet persists script-specific challenges, and (3) machine-translated data underestimates accuracy by 7-22% versus localized data. Our analysis identifies challenges in multilingual instruction following, including keyword consistency preservation and compositional constraint adherence across languages. Our Marco-Bench-MIF will be made publicly available to the community.
Citation
B. Zeng et al., “Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language,” vol. 1, pp. 24058–24072, Aug. 2025, doi: 10.18653/V1/2025.ACL-LONG.1172.
Source
Proceedings of the Annual Meeting of the Association for Computational Linguistics
Conference
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Keywords
Multilingual Instruction-Following, Instruction Tuning Benchmark, Low-Resource Languages, Localization vs Machine-Translation, Large Language Models, Cross-script Challenges, Accuracy Gap High/Low-Resource, Cultural & Linguistic Adaptation
Subjects
Source
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Publisher
Association for Computational Linguistics
