Item

When Large Vision Language Models Meet Multimodal Sequential Recommendation: An Empirical Study

Zhou, Peilin
Liu, Chao
Ren, Jing
Zhou, Xinfeng
Xie, Yueqi
Cao, Meng
Rao, Zhongtao
Huang, Youliang
Chong, Dading
Liu, Junling
... show 4 more
Research Projects
Organizational Units
Journal Issue
Abstract
As multimedia content continues to grow on the web, the integration of visual and textual data has become a crucial challenge for web applications, particularly in recommendation systems. Large Vision Language Models (LVLMs) have demonstrated considerable potential in addressing this challenge across various tasks that require such multimodal integration. However, their application in multimodal sequential recommendation (MSR) has not been extensively studied. To bridge this gap, we introduce MSRBench, the first comprehensive benchmark designed to systematically evaluate different LVLM integration strategies in web-based recommendation scenarios. We benchmark three state-of-the-art LVLMs, i.e., GPT-4 Vision, GPT-4o, and Claude-3-Opus, on the next item prediction task using the constructed Amazon Review Plus dataset, which includes additional item descriptions generated by LVLMs. Our evaluation examines five integration strategies: using LVLMs as recommender, item enhancer, reranker, and various combinations of these roles. The benchmark results reveal that 1) using LVLMs as rerankers is the most effective strategy, significantly outperforming others that rely on LVLMs to directly generate recommendations or only enhance items; 2) GPT-4o consistently achieves the best performance across most scenarios, particularly when employed as a reranker; 3) the computational inefficiency of LVLMs presents a major barrier to their widespread adoption in real-time multimodal recommendation systems. © 2025 Copyright held by the owner/author(s).
Citation
P. Zhou et al., “When Large Vision Language Models Meet Multimodal Sequential Recommendation: An Empirical Study,” WWW 2025 - Proceedings of the ACM Web Conference, vol. 25, pp. 275–292, Apr. 2025, doi: 10.1145/3696410.3714764.
Source
WWW 2025 - Proceedings of the ACM Web Conference
Conference
34th ACM Web Conference, WWW 2025
Keywords
Benchmark, Large Vision Language Model, Multimodal Recommendation
Subjects
Source
34th ACM Web Conference, WWW 2025
Publisher
Association for Computing Machinery
Full-text link