Tackling Real-world Complexity: Hierarchical Modeling and Dynamic Prompting for Multimodal Long Document Classification
Liu, Tengfei ; Hu, Yongli ; Li, Mingjie ; Yi, Junfei ; Chang, Xiaojun ; Gao, Junbin ; Yin, Baocai
Liu, Tengfei
Hu, Yongli
Li, Mingjie
Yi, Junfei
Chang, Xiaojun
Gao, Junbin
Yin, Baocai
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
With the rapid growth of internet content, multimodal long document data has become increasingly prominent, drawing significant attention from researchers. However, most existing methods primarily focus on scenarios where all modalities are present, often overlooking more challenging and realistic cases involving missing image modality. To address this limitation, we propose a robust multimodal long document classification (MLDC) framework that integrates hierarchical modeling and dynamic prompting to handle complex multimodal long document data. Our approach begins by leveraging hierarchical modeling combined with an Adaptive Correlation Multimodal Transformer (ACMT) to effectively capture relationships between text and images at both section and sentence levels. We also introduce a Dynamic Prompt Generation (DPG) module at both levels to enhance the model’s robustness in handling missing image data. By evaluating sample uncertainty, the DPG module dynamically adjusts both the number of prompts and the prompts themselves, allowing the model to better adapt to the varying needs of different samples. Finally, a Hierarchical Heterogeneous Graph (HHG) is introduced to enhance feature interactions across levels, further improving the coherence and accuracy of the model. Extensive experiments on four multi-modal long document datasets demonstrate that our model shows superior performance compared to existing state-of-the-art MLDC classification methods in various conditions.
Citation
T. Liu et al., "Tackling Real-world Complexity: Hierarchical Modeling and Dynamic Prompting for Multimodal Long Document Classification," in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2025.3537759
Source
IEEE Transactions on Circuits and Systems for Video Technology
Conference
Keywords
Adaptation models, Transformers, Data models, Correlation, Circuits, systems, Electronic mail, Robustness, Uncertainty, Training, Complexity theory
Subjects
Source
Publisher
IEEE
