Item

Language-Driven Autonomous Driving: From Open-Vocabulary Perception to Explainable Actions

Ishaq, Ayesha
Department
Computer Vision
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Language provides a natural and intuitive interface for humans to interact with autonomous systems, enabling commands, queries, and interpretable feedback. In the context of autonomous driving, this communication is vital for building systems that are adaptable, transparent, and aligned with human intent. This thesis explores how language can be integrated across different levels of the autonomous driving stack, with a focus on both interaction and reasoning. The first part of this work focuses on enabling openended queries through an open-vocabulary 3D multi-object tracking framework. By leveraging 2D open-vocabulary detections and introducing techniques such as class-agnostic tracking and confidence score prediction, our approach extends beyond fixed object categories to detect and track novel objects. This allows users to issue language-based queries involving previously unseen object classes, increasing the flexibility and usability of the system in complex environments. Building on this foundation, the second part of the thesis investi gates how language can support interpretability and explainability in autonomous driving. We introduce a method for integrating tracking information into large multimodal models, capturing the rich spatio-temporal dynamics of realworld scenes. Through a dedicated trajectory encoder, multimodal fusion strategy, and pretraining tailored for driving tasks, our model improves perception, prediction, and planning while enabling more informative, language-based explanations of the system’s decisions. Finally, we present Drive LMMo1, a new reasoning dataset, test benchmark and model designed to evaluate step-by-step logical inference in autonomous driving contexts. To support transparent decision-making, we introduce an evaluation metric that considers both the coherence of intermediate reasoning steps and the accuracy of the final answer. Our model trained on Drive LMMo1 demonstrates significant improvements in reasoning capabilities and interpretability, out performing existing open-source approaches. Collectively, this thesis highlights the role of language not just as a tool for human-to-vehicle communication, but also as a powerful mechanism for making autonomous systems more transparent, flexible, and aligned with human expectations.
Citation
Ayesha Ishaq, “Language-Driven Autonomous Driving: From Open-Vocabulary Perception to Explainable Actions,” Master of Science thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Autonomous Driving, Open Vocabulary, Vision Language Models, Multimodal Models, 3D Tracking, Visual Reasoning
Subjects
Source
Publisher
DOI
Full-text link