VLDUS: Vision-language distillated unseen synthesizer for zero-shot object detection
Yan, Caixia ; Jiao, Muyan ; Xue, Nuohan ; Zhang, Weizhan ; Wang, Jiahao ; Chang, Xiaojun ; Tian, Feng
Yan, Caixia
Jiao, Muyan
Xue, Nuohan
Zhang, Weizhan
Wang, Jiahao
Chang, Xiaojun
Tian, Feng
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Generative methods have shown promising performance on zero-shot object detection (ZSD) by synthesizing visual features of unseen classes from semantic embeddings. Although largely compensating for the lack of training samples, they learn the feature synthesizer of unseen classes solely based on limited training data of seen classes, leading to poor diversity and generalization ability of synthesized unseen samples. To overcome this challenge, we develop a Vision-Language Distillated Unseen Synthesizer, namely VLDUS, to build up a novel knowledge distillation-based feature generation paradigm for ZSD. To regulate the synthesized feature space, VLDUS designs two complementary generative distillation strategies that can distill rich image-text knowledge from a pre-trained CLIP model to the synthesizer. To mitigate the over-fitting towards seen classes, VLDUS performs feature-aligned generative distillation on the discriminator's embedding space to methodically learn from the CLIP embedding space, and thus endows the synthesizer with strong generalization ability. To guarantee the intra-class diversity of synthesized unseen features, relation-aligned generative distillation is further performed to distill the diversified image-text correlations from pre-trained CLIP model to the synthesizer. Extensive experiments on MS COCO 2014, PASCAL VOC 2007/2012 and DIOR demonstrate that the proposed VLDUS can generate unseen features of both high intra-class diversity and inter-class separability, and thus outperforms state-of-the-art methods by a large margin on both ZSD and GZSD tasks. Our code is publicly available at https://github.com/Xxxnh/VLDUS.
Citation
C. Yan, M. Jiao, N. Xue, W. Zhang, J. Wang, X. Chang , et al., "VLDUS: Vision-language distillated unseen synthesizer for zero-shot object detection," Neural Networks, vol. 201, pp. 108899-108899, 2026, https://doi.org/10.1016/j.neunet.2026.108899.
Source
Neural Networks
Conference
Keywords
46 Information and Computing Sciences, 4611 Machine Learning
Subjects
Source
Publisher
Elsevier
