Item

Surfer: A World Model-Based Framework for Vision-Language Robot Manipulation

Ren, Pengzhen
Zhang, Kaidong
Zheng, Hetao
Li, Zixuan
Wen, Yuhang
Zhu, Fengda
Ma, Shikui
Liang, Xiaodan
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Considering how to make the model accurately understand and follow natural language instructions and perform actions consistent with world knowledge is a key challenge in robot manipulation. This mainly includes human fuzzy instruction reasoning and the following of physical knowledge. Therefore, the embodied intelligence agent must have the ability to model world knowledge from training data. However, most existing vision and language robot manipulation methods mainly operate in less realistic simulators and language settings and lack explicit modeling of world knowledge. To bridge this gap, we introduce a novel and simple robot manipulation framework, called Surfer. It is based on the world model, treats robot manipulation as a state transfer of the visual scene, and decouples it into two parts: action and scene. Then, the generalization ability of the model on new instructions and new scenes is enhanced by explicit modeling of the action and scene prediction in multimodal information. In addition, we built a robot manipulation simulation platform that supports physics execution based on the MuJoCo physics engine. It can automatically generate demonstration training data and test data, effectively reducing labor costs. To conduct a comprehensive and systematic evaluation of the visual-language understanding and physical execution of the manipulation model, we also created a robotic manipulation benchmark with different difficulty levels, called SeaWave. It contains four visual-language manipulation tasks of different difficulty levels and can provide a standardized testing platform for embedded AI agents in multimodal environments. Overall, we hope Surfer can freely surf in the robot’s SeaWave benchmark. Extensive experiments show that Surfer consistently outperforms all baselines significantly in all manipulation tasks. On average, Surfer achieved a success rate of 54.74% on the defined four levels of manipulation tasks, exceeding the best baseline performance of 51.07%. The simulator, code, and benchmarks are released at https://pzhren.github.io/Surfer.
Citation
P. Ren et al., "Surfer: A World Model-Based Framework for Vision-Language Robot Manipulation," in IEEE Transactions on Neural Networks and Learning Systems, doi: 10.1109/TNNLS.2025.3594117
Source
IEEE Transactions on Neural Networks and Learning Systems, 2025
Conference
Keywords
Large Language Model, Manipulation Benchmark, Robot Manipulation, World Model
Subjects
Source
Publisher
IEEE
Full-text link