Item

Mining the Salient Spatio-Temporal Feature with S2TF-Net for action recognition

Liu, Xiaoxi
Liu, Ju
Gu, Lingchen
Li, Yafeng
Chang, Xiaojun
Nie, Feiping
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Recently, 3D Convolutional Neural Networks (3D ConvNets) have been widely exploited for action recognition and achieved satisfying performance. However, the superior action features are often drowned in numerous irrelevant information, which immensely enhances the difficulty of video representation. To find a generic cost-efficient approach to balance the parameters and performance, we present a novel network to mine the Salient Spatio-Temporal Feature based on 3D ConvNets backbone for action recognition, termed as S2TF-Net. Firstly, we extract the salient features of each 3D residual block by constructing a multi-scale module for Salient Semantic Feature mining (SSF-Module). Then, with the aim of preserving the salient features in pooling operations, we establish a Two-branch Salient Feature Preserving Module (TSFP-Module). Besides, these above two modules with proper loss function can collaborate in an “easy-to-concat” fashion for most 3D ResNet backbones to classify more accurately albeit in the shallower network. Finally, we conduct experiments over three popular action recognition datasets, where our S2TF-Net is competitive compared with the deeper 3D backbones or current state-of-the-art results. Treating the P3D, 3D ResNet, Non-local I3D and X3D as baseline, the proposed method improves them to varying degrees. Particularly, for Non-local I3D ResNet, the proposed S2TF-Net enhances 4.1%, 3.0% and 4.6% in Kinetics-400, UCF101 and HMDB51 datasets, achieving the accuracy of 74.8%, 95.1% and 80.9%. We hope this study will provide useful inspiration and experience for future research about more cost-effective methods. Code is released here: https://github.com/xiaoxiAries/S2TFNet.
Citation
X. Liu, J. Liu, L. Gu, Y. Li, X. Chang, and F. Nie, “Mining the Salient Spatio-Temporal Feature with S2TF-Net for action recognition,” Signal Process Image Commun, vol. 138, p. 117381, Oct. 2025, doi: 10.1016/J.IMAGE.2025.117381
Source
Signal Processing: Image Communication
Conference
Keywords
Video classification, Action recognition, 3D residual block, Salient features, Pooling
Subjects
Source
Publisher
Elsevier
Full-text link