Scene Understanding for Vision-Language Navigation and Embodied Question Answering
Zhumakhanova, Kamila
Zhumakhanova, Kamila
Author
Supervisor
Department
Computer Vision
Embargo End Date
2026-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Vision Language Navigation (VLN) and Embodied Question Answering (EQA) are two fundamental challenges in the development of Embodied AI, which require seamless alignment between visual and natural language understanding. VLN research is often constrained by the limited diversity and scale of training data, as most datasets rely on manually curated simulators. To address this, we introduce Room Tour3D, a largescale video-instruction dataset derived from web-based room tour videos that capture real-world indoor environments and human navigation behavior. Unlike existing VLN datasets, Room Tour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. Since web-based videos lack explicit navigation data, we reconstruct 3D scenes and extract structured trajectories enriched with room type annotations, object locations, and spatial scene geometry. Our dataset consists of approximately 100K descriptionenriched trajectories with 200K navigation instructions, along with 17K actionenriched trajectories spanning 1,847 room tour environments. Experimental results demonstrate that Room Tour3D leads to significant performance gains across multiple VLN benchmarks, including CVDN, SOON, R2R, and REVERIE. Meanwhile, Embodied Question Answering formulated as zero-shot video question an swering problem suffers from the need to process large amount of video data, which may result in the inability to capture finegrained details to answer the question. To tackle this, we propose a novel Relevance and Reverse logic (RRL) filtering pipeline that selects only the most relevant frames while incorporating reverse question logic to enhance robustness against hallucinations in VLMs. Our approach achieves performance comparable to the baseline on the Open EQA dataset while reducing visual data processing by up to 80%, making it significantly more efficient. By addressing these challenges in VLN and EQA, our work contributes to the advance ment of Embodied AI, offering scalable realworld training data and more efficient visual reasoning strategies.
Citation
Kamila Zhumakhanova, “Scene Understanding for Vision-Language Navigation and Embodied Question Answering,” Master of Science thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Vision-Language Navigation, Embodied Question Answering, Video Question Answering
