Item

Provably Learning a Multi-head Attention Layer

Chen, Sitan
Li, Yuanzhi
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
The multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. Given a sequence length k, attention matrices ς1,...,ςmϵλ., d× d, and projection matrices W1,...,Wmϵλ., d× d, the corresponding multi-head attention layer F: λ., k× d→ λ., k× d transforms length-k sequences of d-dimensional tokens Xϵλ., k× d via F(X) λ‰ λ'i=1m softmax(XςiXλCurrency sign)XWi. In this work, we initiate the study of provably learning a multi-head attention layer from random examples and give the first nontrivial upper and lower bounds for this problem. Provided {Wi, ςi} satisfy certain non-degeneracy conditions, we give a (dk)O(m-time algorithm that learns F to small error given random labeled examples drawn uniformly from {± 1}k× d. We also prove computational lower bounds showing that in the worst case, exponential dependence on the number of heads m is unavoidable. We chose to focus on Boolean X to mimic the discrete nature of tokens in large language models, though our techniques naturally extend to standard continuous settings, e.g. Gaussian. Our algorithm, which is centered around using examples to sculpt a convex body containing the unknown parameters, is a significant departure from existing provable algorithms for learning feed-forward networks, which predominantly exploit fine-grained algebraic and rotation invariance properties of the Gaussian distribution. In contrast, our analysis is more flexible as it primarily relies on various upper and lower tail bounds for the input distribution and "slices"thereof. © 2025 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
Citation
S. Chen, Y. Li MBZUAI Abu Dhabi, and Y. Li, “Provably Learning a Multi-head Attention Layer,” pp. 1744–1754, Jun. 2025, doi: 10.1145/3717823.3718174.
Source
Proceedings of the Annual ACM Symposium on Theory of Computing
Conference
57th Annual ACM Symposium on Theory of Computing, STOC 2025
Keywords
attention, PAC learning, Supervised learning, transformers
Subjects
Source
57th Annual ACM Symposium on Theory of Computing, STOC 2025
Publisher
Association for Computing Machinery
Full-text link