Item

PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation

Zhang, Kaidong
Ren, Pengzhen
Lin, Bingqian
Lin, Junfan
Ma, Shikui
Xu, Hang
Liang, Xiaodan
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2024
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Language-guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks. Previous work generally maps instructions and visual perceptions directly to low-level executable actions, neglecting the modeling of critical waypoints (e.g., key states of “close to/grab/move up” in action trajectories) in manipulation tasks.To address this issue, we propose a PImitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) that focuses solely on the prediction of task-relevant waypoints. Specifically, PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive-driven waypoint prediction, while the latter focuses on decoding low-level actions. Additionally, we also design an asynchronous hierarchical executor (AHE) for PIVOT-R, which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency. Our PIVOT-R outperforms state-of-the-art (SoTA) open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT-R, the execution efficiency of PIVOT-R with AHE is increased by 28-fold, with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT-R can significantly improve both the performance and efficiency of robotic manipulation.
Citation
K. Zhang et al., “PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation,” Adv Neural Inf Process Syst, vol. 37, pp. 54105–54136, Dec. 2024, Accessed: Mar. 24, 2025. [Online]. Available: https://abliao.github.io/PIVOT-R
Source
Advances in Neural Information Processing Systems (NeurIPS 2024)
Conference
Keywords
Primitive-driven waypoint-aware world model, language-guided robotic manipulation, asynchronous hierarchical executor, task-relevant waypoint prediction, computational efficiency
Subjects
Source
Publisher
NEURIPS
DOI
Full-text link