MAviS: An Audio-Visual Conversational Assistant for Avian Species
Kryklyvets, Yevheniia
Kryklyvets, Yevheniia
Author
Supervisor
Department
Computer Vision
Embargo End Date
2026-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Accurate and precise species identification is crucial for biodiversity conservation, ecological monitoring, and species preservation efforts. However, existing multimodal large language models (MMLLMs) lack the domains-pecific knowledge necessary for high-precision tasks, resulting in suboptimal performance in species classification and habitat assessment. These limitations stem from the absence of curated multimodal datasets, suboptimal cross-modal representations, and challenges in aligning image, audio, and text modalities. Addressing this gap requires the development of highquality multi-modal datasets suited for fine grained identification of closely related classes. This thesis introduces MAviS (Multimodal Avian Species) Framework, a compre hensive framework designed to improve species identification through a combination of a large-scale dataset, benchmarking tools, and a fine-tuned MM-LLM model. The MAviS Dataset integrates images, audio, and textual descriptions of more than 900 bird species sourced from Tree of Life, iNaturalist, and BirdCLEF, providing a rich multimodal re source for training and evaluation. Additionally, we present MAvi SBench, a standardized benchmark that evaluates the effectiveness of MM-LLMs in recognizing bird species. To ensure reliable and biologically meaningful evaluation, we introduce MAviS-Eval, a referencebased metric inspired by recent work on LLM-as-a-judge, and adapted to the finegrained requirements of ecological classification to provide robust, interpretable assessments of model output quality. It complements MAviS-Bench by serving as the evaluation backbone for scoring species identification performance. Building on this dataset, we introduce MAviSCPM (MAvi Sfinetuned Mini-CPMo-2.6), a domain-specific MM-LLM optimized for species classification and detailed species description generation. We evaluated MAviS-CPM against leading proprietary models (GPT-4o, Gemini 1.5) and an open-source MM-LLMs (Mini-CPMo-2.6 and Phi-4), demonstrating that multimodal pretraining, followed by domain-specific fine-tuning, significantly improves recognition accuracy and crossmodal alignment. Our evaluation uses quantitative (statistical and model-based) methods to validate the frameworkâs effectiveness. The findings of this research emphasize the need for domain adaptive MM-LLMs for ecological applications, addressing critical challenges in species identification and supporting AI-driven conservation initiatives. By integrating AI with sustainability efforts, MAviS contributes to achieving Sustainable Development Goal (SDG) 15, Life on Land, and indirectly supports broader global sustainability goals. It establishes a strong foundation for future advancements in multimodal learning for biodiversity research, with potential applications in automated wildlife monitoring, habitat assessment, community science, and conservation policy development.
Citation
Yevheniia Kryklyvets, âMAviS: An Audio-Visual Conversational Assistant for Avian Species,â Master of Science thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Multimodal Learning, Benchmark Datasets, Audio-Visual Grounding, Wildlife-Aware Language Models
