Item

Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation

Zhou, Jinxing
Li, Zhihui
Yu, Yongqiang
Zhou, Yanghao
Guo, Ruohao
Li, Guangyao
Mao, Yuxin
Han, Mingfei
Chang, Xiaojun
Wang, Meng
Research Projects
Organizational Units
Journal Issue
Abstract
Mainstream research in audio-visual learning has focused on designing task-specific expert models, primarily implemented through sophisticated multimodal fusion approaches. Recently, a few efforts have aimed to develop more task-independent or universal audiovisual embedding networks, encoding advanced representations for use in various audiovisual downstream tasks. This is typically achieved by fine-tuning large pretrained transformers, such as Swin-V2-L and HTS-AT, in a parameter-efficient manner through techniques such as tuning only a few adapter layers inserted into the pretrained transformer backbone. Although these methods are parameter-efficient, they suffer from significant training memory consumption due to gradient backpropagation through the deep transformer backbones, which limits accessibility for researchers with constrained computational resources. In this paper, we present Meta-Token Learning (Mettle), a simple and memory-efficient method for adapting large-scale pretrained transformer models to downstream audio-visual tasks. Instead of sequentially modifying the output feature distribution of the transformer backbone, Mettle utilizes a lightweight Layer-Centric Distillation (LCD) module to distill in parallel the intact audio or visual features embedded by each transformer layer into compact meta-tokens. This distillation process considers both pretrained knowledge preservation and task-specific adaptation. The obtained meta-tokens can be directly applied to classification tasks, such as audio-visual event localization and audio-visual video parsing. To further support fine-grained segmentation tasks, such as audio-visual segmentation, we introduce a Meta-Token Injection (MTI) module, which utilizes the audio and visual meta-tokens distilled from the top transformer layer to guide feature adaptation in earlier layers. Extensive experiments on multiple audiovisual benchmarks demonstrate that our method significantly reduces memory usage and training time while maintaining parameter efficiency and competitive accuracy.
Citation
J. Zhou et al., "Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2025.3642821
Source
IEEE Transactions on Pattern Analysis and Machine Intelligence
Conference
Keywords
Audio-Visual Event Localization, Audio-Visual Segmentation, Audio-Visual Video Parsing, Memory Efficient Learning, Parameter Efficient Learning
Subjects
Source
Publisher
IEEE
Full-text link