Item

Language-Conditioned Waypoint Predictor for Continuous Vision-and-Language Navigation

Wang, Zeyu
Qi, Yuankai
An, Dong
Yang, Xu
Li, Hongxin
Zhang, Zhaoxiang
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Waypoint prediction is a popular technique for Vision-and-Language Navigation in Continuous Environments (VLN-CE), which abstracts navigable locations as waypoints to ease the subsequent action prediction. Nevertheless, we found current waypoint predictors are not always accurate, limiting navigation's overall performance. One possible reason may be the lack of language context, leading to the failure to generate corresponding waypoints for critical locations mentioned in the instructions. To that end, we propose a novel framework to enable the training of the language-conditioned waypoint predictor. First, as the VLN-CE agents ground instructions with the environment when navigating, we employ a pre-trained agent to encode language for the waypoint predictor. Second, the language-conditioned waypoint predictor is trained with the data collected using the same agent. Third, we train the new VLN-CE navigation agent with the proposed waypoint predictor. Fourth, the disparity between the language encoder agent and the navigation agent drives us to devise a cycle training scheme to alternately train the agent and the waypoint predictor, further enhancing the performance of both the waypoint predictor and navigation agent. Experimental results show that our waypoint predictor's performance surpasses all existing ones. With better waypoints, the gap between waypoint-based methods and their upper bound narrows by about 60%.
Citation
Z. Wang, Y. Qi, D. An, X. Yang, H. Li and Z. Zhang, "Language-Conditioned Waypoint Predictor for Continuous Vision-and-Language Navigation," 2025 IEEE International Conference on Multimedia and Expo (ICME), Nantes, France, 2025, pp. 1-6, doi: 10.1109/ICME59968.2025.11209085.
Source
Proceedings - IEEE International Conference on Multimedia and Expo
Conference
2025 IEEE International Conference on Multimedia and Expo, ICME 2025
Keywords
embodied AI, multi-modality, vision-and-language navigation
Subjects
Source
2025 IEEE International Conference on Multimedia and Expo, ICME 2025
Publisher
IEEE
Full-text link