Item

Domain-Specialized Vision-Language Pre-training via Cross-Model Alignment for Fine-Grained Zero-Shot Recognition

Nawaz, Umair
Department
Computer Vision
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Large-scale visionlanguage pre-training leverages substantial amount of image-text data and has recently shown promising performance and generalizability on different down stream tasks. While these vision-language foundation models achieve impressive zero-shot performance on downstream tasks with standard natural imagery, they struggle on specialized domains such as agriculture, typically requiring visual recognition at a fine-grained level. This performance degradation is largely due to domain shifts, where the intrinsic characteristics of domain-specialized data (e.g., subtle visual cues, specific environmental conditions, and expert-driven annotations in agricultural data) are not adequately cap tured by general purpose training corpora. Moreover, the tasks in specialized agriculture domain, such as disease diagnosis, nutrient deficiency detection, and breed classification, demand encoding finegrained features that standard foundation models typically struggle to capture. In this thesis, we introduce AgriCLIP, a visionlanguage foundational model dedicated to agriculture and livestock. To train the proposed foundational model, we curate a specialized dataset, named ALive, which comprises approximately 600,000 image-text pairs. The dataset is carefully curated to cover various domains of agriculture, including crops, livestock, and fisheries. Rather than relying on exhaustive manual annotations, we employ a customized prompt generation strategy that integrates metadata and classspecific information to generate descriptive text for each image. This method mitigates the scarcity of expert annotations and enriches the dataset with a nuanced and domain=specific context. Building upon this dataset, we propose a multi=stage training pipeline that combines two complementary learning paradigms. In the first stage, contrastive learning is used to align image and text representations at a global level, thereby establishing a robust foundation for semantic understanding. In the second stage, we integrate a self-supervised learning technique that focuses on extracting fine-grained features enabling the model to capture subtle visual cues that are essential for specialized tasks such as livestock breed classification and nutrient deficiency detection. This dual approach ensures that AgriCLIP is not only adept at general semantic alignment but also attentive to the minute details in agricultural applications. The proposed approach is validated on 20 diverse downstream tasks containing around 300,000 newly unseen images, achieving an absolute gain of 9.07% in average zero-shot classification accuracy compared to the baseline CLIP model.
Citation
Umair Nawaz, “Domain-Specialized Vision-Language Pre-training via Cross-Model Alignment for Fine-Grained Zero-Shot Recognition,” Master of Science thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Contrastive Learning, Self-Supervised Learning, Cross-Model Alignment, Domain-Specific Tuning, Vision-Language Models
Subjects
Source
Publisher
DOI
Full-text link