Loading...
Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation
Su, Yifei ; An, Dong ; Chen, Kehan ; Yu, Weichen ; Ning, Baiyang ; Ling, Yonggen ; Huang, Yan ; Wang, Liang
Su, Yifei
An, Dong
Chen, Kehan
Yu, Weichen
Ning, Baiyang
Ling, Yonggen
Huang, Yan
Wang, Liang
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Aerial Vision-Dialog Navigation (AVDN) is a new task that requires drones to navigate to a target location based on human-robot dialog history. This paper focuses on the critical fine-grained cross-modal alignment problem in AVDN, requiring the drone to align language entities with visual landmarks in top-down views. To achieve this, we first construct a Fine-Grained AVDN (FG-AVDN) dataset via a semi-automatic annotation pipeline, providing diverse multimodal annotations at the entity-landmark level. Based on this, a novel Fine-grained Entity-Landmark Alignment (FELA) method is proposed to learn the cross-modal alignment explicitly. Concretely, FELA first boosts the drone’s visual understanding with a precise semantic grid representation, which captures the environmental semantics and spatial structure simultaneously. Subsequently, to learn the entity-landmark alignment, we devise cross-modal auxiliary tasks from three perspectives, including grounding, captioning, and contrastive learning. Extensive experiments demonstrate that our explicit entity-landmark alignment learning is beneficial for AVDN. As a result, FELA achieves leading performance with 3.2% SR and 4.9% GP improvements over prior arts.
Citation
Y. Su et al., “Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, pp. 7060–7068, Apr. 2025, doi: 10.1609/AAAI.V39I7.32758.
Source
Proceedings of the AAAI Conference on Artificial Intelligence
Conference
39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Keywords
Contrastive Learning, Robot learning, Alignment Problems, Cross-modal, Fine grained, Human-robot dialogue, Language entities, Learn+, Location based, Target location, Topdown, Visual landmarks, Robots
Subjects
Source
39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Publisher
Association for the Advancement of Artificial Intelligence
