Loading...
MoMentS: A Comprehensive Multimodal Benchmark for Theory of Mind
Villa-Cueva, Emilio ; Ahmed, SM Masrur ; Chevi, Rendi ; Cruz, Jan Christian Blaise ; Elzeky, Kareem ; Cristobal, Fermin ; Aji, Alham Fikri ; Wang, Skyler ; Mihalcea, Rada ; Solorio, Thamar
Villa-Cueva, Emilio
Ahmed, SM Masrur
Chevi, Rendi
Cruz, Jan Christian Blaise
Elzeky, Kareem
Cristobal, Fermin
Aji, Alham Fikri
Wang, Skyler
Mihalcea, Rada
Solorio, Thamar
Files
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
License
http://creativecommons.org/licenses/by/4.0/
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MoMentS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MoMentS includes over 2,300 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters’ mental states. We evaluate several MLLMs and find that although vision generally improves performance, models still struggle to integrate it effectively. For audio, models that process dialogues as audio do not consistently outperform transcript-based inputs. Our findings highlight the need to improve multimodal integration and point to open challenges that must be addressed to advance AI’s social understanding.
Citation
E. Villa-Cueva, S.M.M. Ahmed, R. Chevi, J.C.B. Cruz, K. Elzeky, F. Cristobal, A.F. Aji, S. Wang, R. Mihalcea, T. Solorio, "MoMentS: A Comprehensive Multimodal Benchmark for Theory of Mind," 2025, pp. 22591-22611.
Source
Conference
Findings of the Association for Computational Linguistics: EMNLP 2025
Keywords
Subjects
Source
Findings of the Association for Computational Linguistics: EMNLP 2025
Publisher
Association for Computational Linguistics
