Item

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

Han, Mingfei
Ma, Liang
Zhumakhanova, Kamila
Radionova, Ekaterina
Zhang, Jingyi
Chang, Xiaojun
Liang, Xiaodan
Laptev, Ivan
Author
Han, Mingfei, Ma, Liang, Zhumakhanova, Kamila, Radionova, Ekaterina, Zhang, Jingyi, Chang, Xiaojun, Liang, Xiaodan, Laptev, Ivan
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Research Projects
Organizational Units
Journal Issue
Abstract
Vision-And-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes ~100K open-ended description-enriched trajectories with ~200K instructions, and 17K action-enriched trajectories from 1847 room tour environments. We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE. Moreover, Room-Tour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.
Citation
M. Han et al., "RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation," 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2025, pp. 27586-27596, doi: 10.1109/CVPR52734.2025.02569.
Source
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Conference
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025
Keywords
Embodied Navigation, Large Vision-language Models, Room Tour Videos, Vision-and-language Navigation
Subjects
Source
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025
Publisher
IEEE
Full-text link