CILF-CIAE: CLIP-driven image-language fusion for correcting inverse age estimation
Shou, Yuntao ; Meng, Tao ; Ai, Wei ; Yin, Nan ; Li, Keqin
Shou, Yuntao
Meng, Tao
Ai, Wei
Yin, Nan
Li, Keqin
Supervisor
Department
Machine Learning
Embargo End Date
Type
Journal article
Date
2026
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
The age estimation task aims to predict the age of an individual by analyzing facial features in an image. The development of age estimation can improve the efficiency and accuracy of various applications (e.g., age verification and secure access control, etc.). In recent years, contrastive language-image pre-training (CLIP) has been widely used in various multimodal tasks and has made some progress in the field of computer vision. However, the promotion of CLIP and error feedback mechanisms for age estimation has not been investigated, and existing Transformer-based methods require high memory usage (quadratic complexity) when globally modeling images. To tackle the above issues, we propose a novel CLIP-driven Image-Language Fusion for Correcting Inverse Age Estimation (CILF-CIAE). Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture (i.e., FourierFormer) to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Compared with the quadratic complexity of the attention mechanism, the proposed FourierFormer is of linear log complexity. To further narrow the semantic gap between image and text features, we utilize an efficient contrastive multimodal learning module that supervises the multimodal fusion process of FourierFormer through contrastive loss for image-text matching, thereby improving the interaction effect between different modalities. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions. Extensive experiments on six benchmark datasets demonstrate that CILF-CIAE consistently outperforms advanced methods such as LRA-GNN and MCGRL. For example, our method achieves an MAE of 1.68 on MORPH-S2, significantly lower than 2.21 (LRA-GNN) and 1.77 (MCGRL), highlighting its superior accuracy and robustness in real-world age estimation scenarios.
Citation
Y. Shou, T. Meng, W. Ai, N. Yin, and K. Li, “CILF-CIAE: CLIP-driven image-language fusion for correcting inverse age estimation,” Neural Networks, vol. 197, p. 108518, May 2026, doi: 10.1016/J.NEUNET.2025.108518
Source
Neural Networks
Conference
Keywords
Age estimation, CLIP, Transformer, Image-language fusion, Fourier transform, Error correction
Subjects
Source
Publisher
Elsevier
