Item

ASTAnet: Transformer-based Siamese Network for Robust Audio-to-Audio Alignment in Amateur User Generated Audio Clips

Singh, Malya
Choudhary, Priyankar
El Saddik, Abdulmotaleb Ei
Saini, Mukesh Kumar
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Audio alignment involves synchronizing two or more audio recordings. Existing methods depend on handcrafted features and struggle with precision in lengthy or noisy recordings. Deep learning techniques have proven effective across various domains; however, their application in audio-to-audio alignment is still in its infancy. We propose ASTAnet, a framework that integrates the Vision Transformer for feature extraction with the Siamese network for similarity estimation. With timestamp positional encoding, ASTAnet improves temporal precision and reduces alignment errors using a contrastive learning objective based on Euclidean distance. Our experiments achieved an overall mean absolute error value of 0.005, a 1.8X improvement compared to the previous works. Extensive evaluations demonstrate its effectiveness, particularly for varying and longer audio recordings.
Citation
M. Singh, P. Choudhary, A. El Saddik and M. Saini, "ASTAnet: Transformer-based Siamese Network for Robust Audio-to-Audio Alignment in Amateur User Generated Audio Clips," 2025 IEEE International Conference on Multimedia and Expo (ICME), Nantes, France, 2025, pp. 1-6, doi: 10.1109/ICME59968.2025.11209138
Source
Proceedings - IEEE International Conference on Multimedia and Expo
Conference
2025 IEEE International Conference on Multimedia and Expo, ICME 2025
Keywords
Audio alignment, Siamese network, Vision transformer
Subjects
Source
2025 IEEE International Conference on Multimedia and Expo, ICME 2025
Publisher
IEEE
Full-text link