γ−MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Luo Yaxin; Luo Gen; Ji Jiayi; Zhou Yiyi; Sun Xiaoshuai; Shen Zhiqiang; Ji Rongrong

Item

γ−MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Luo Yaxin
;
Luo Gen
;
Ji Jiayi
;
Zhou Yiyi
;
Sun Xiaoshuai
;
Shen Zhiqiang
;
Ji Rongrong

Author

Luo Yaxin , Luo Gen , Ji Jiayi , Zhou Yiyi , Sun Xiaoshuai , Shen Zhiqiang , Ji Rongrong

Department

Machine Learning

Type

Conference proceeding

Date

2025

Language

English

Collections

Publications

Show all metadata

Abstract

Despite the significant progress in multimodal large language models (MLLMs), their high computational cost remains a barrier to real-world deployment. Inspired by the mixture of depths (MoDs) in natural language processing, we aim to address this limitation from the perspective of “activated tokens”. Our key insight is that if most tokens are redundant for the layer computation, then can be skipped directly via the MoD layer. However, directly converting the dense layers of MLLMs to MoD layers leads to substantial performance degradation. To address this issue, we propose an innovative MoD adaptation strategy for existing MLLMs called γ-MoD. In γ-MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM, namely rank of attention maps (ARank). Through ARank, we can effectively identify which layer is redundant and should be replaced with the MoD layer. Based on ARank, we further propose two novel designs to maximize the computational sparsity of MLLM while maintaining its performance, namely shared vision-language router and masked routing learning. With these designs, more than 90% dense layers of the MLLM can be effectively converted to the MoD ones. To validate our method, we apply it to three popular MLLMs, and conduct extensive experiments on 9 benchmark datasets. Experimental results not only validate the significant efficiency benefit of γ-MoD to existing MLLMs but also confirm its generalization ability on various MLLMs. For example, with a minor performance drop, i.e., -0.9%, γ-MoD can reduce the training and inference time of LLaVA-HR by 31.0% and 53.2%, respectively. © 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.

Citation

Y. Luo et al., “γ−MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models,” International Conference on Representation Learning, vol. 2025, pp. 72170–72183, May 2025

Source

13th International Conference on Learning Representations, ICLR 2025

Conference

13th International Conference on Learning Representations, ICLR 2025

Source

13th International Conference on Learning Representations, ICLR 2025

Publisher

International Conference on Learning Representations, ICLR

Additional links

https://proceedings.iclr.cc/paper_files/paper/2025/hash/b3847cda0c8cc0cfcdacf462dc122214-Abstract-Conference.html

γ−MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Luo Yaxin
;
Luo Gen
;
Ji Jiayi
;
Zhou Yiyi
;
Sun Xiaoshuai
;
Shen Zhiqiang
;
Ji Rongrong

Author

Supervisor

Department

Embargo End Date

Type

Date

License

Language

Collections

Research Projects

Organizational Units

Journal Issue

Abstract

Citation

Source

Conference

Keywords

Subjects

Source

Publisher

DOI

Additional links

Full-text link

γ−MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Luo Yaxin ; Luo Gen ; Ji Jiayi ; Zhou Yiyi ; Sun Xiaoshuai ; Shen Zhiqiang ; Ji Rongrong

Author

Supervisor

Department

Embargo End Date

Type

Date

License

Language

Collections

Research Projects

Organizational Units

Journal Issue

Abstract

Citation

Source

Conference

Keywords

Subjects

Source

Publisher

DOI

Additional links

Full-text link

Luo Yaxin
;
Luo Gen
;
Ji Jiayi
;
Zhou Yiyi
;
Sun Xiaoshuai
;
Shen Zhiqiang
;
Ji Rongrong