OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Zhou, Pengfei ; Peng, Xiaopeng ; Song, Jiajun ; Li, Chuanhao ; Xu, Zhaopan ; Yang, Yue ; Guo, Ziyao ; Zhang, Hao ; Lin, Yuqi ; He, Yefei ... show 8 more
Zhou, Pengfei
Peng, Xiaopeng
Song, Jiajun
Li, Chuanhao
Xu, Zhaopan
Yang, Yue
Guo, Ziyao
Zhang, Hao
Lin, Yuqi
He, Yefei
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to data size and diversity limitations. To bridge this gap, we introduce OpenING, a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82.42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models.
Citation
P. Zhou et al., “OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation,” 2025. Accessed: Jun. 24, 2025. [Online]. Available: https://opening-benchmark.github.io
Source
Proceedings of the Computer Vision and Pattern Recognition Conference
Conference
Computer Vision and Pattern Recognition Conference (CVPR), 2025
Keywords
Subjects
Source
Computer Vision and Pattern Recognition Conference (CVPR), 2025
Publisher
Computer Vision Foundation
