Loading...
Thumbnail Image
Item

Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation

Su, Yifei
An, Dong
Chen, Kehan
Yu, Weichen
Ning, Baiyang
Ling, Yonggen
Huang, Yan
Wang, Liang
Research Projects
Organizational Units
Journal Issue
Abstract
Aerial Vision-Dialog Navigation (AVDN) is a new task that requires drones to navigate to a target location based on human-robot dialog history. This paper focuses on the critical fine-grained cross-modal alignment problem in AVDN, requiring the drone to align language entities with visual landmarks in top-down views. To achieve this, we first construct a Fine-Grained AVDN (FG-AVDN) dataset via a semi-automatic annotation pipeline, providing diverse multimodal annotations at the entity-landmark level. Based on this, a novel Fine-grained Entity-Landmark Alignment (FELA) method is proposed to learn the cross-modal alignment explicitly. Concretely, FELA first boosts the drone’s visual understanding with a precise semantic grid representation, which captures the environmental semantics and spatial structure simultaneously. Subsequently, to learn the entity-landmark alignment, we devise cross-modal auxiliary tasks from three perspectives, including grounding, captioning, and contrastive learning. Extensive experiments demonstrate that our explicit entity-landmark alignment learning is beneficial for AVDN. As a result, FELA achieves leading performance with 3.2% SR and 4.9% GP improvements over prior arts.
Citation
Y. Su et al., “Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, pp. 7060–7068, Apr. 2025, doi: 10.1609/AAAI.V39I7.32758.
Source
Proceedings of the AAAI Conference on Artificial Intelligence
Conference
39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Keywords
Contrastive Learning, Robot learning, Alignment Problems, Cross-modal, Fine grained, Human-robot dialogue, Language entities, Learn+, Location based, Target location, Topdown, Visual landmarks, Robots
Subjects
Source
39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Publisher
Association for the Advancement of Artificial Intelligence
Full-text link