What can a Single Attention Layer Learn? A Study Through the Random Features Lens

07/21/2023
by   Hengyu Fu, et al.
0

Attention layers – which map a sequence of inputs to a sequence of outputs – are core building blocks of the Transformer architecture which has achieved significant breakthroughs in modern artificial intelligence. This paper presents a rigorous theoretical study on the learning and generalization of a single multi-head attention layer, with a sequence of key vectors and a separate query vector as input. We consider the random feature setting where the attention layer has a large number of heads, with randomly sampled frozen query and key matrices, and trainable value matrices. We show that such a random-feature attention layer can express a broad class of target functions that are permutation invariant to the key vectors. We further provide quantitative excess risk bounds for learning these target functions from finite samples, using random feature attention with finitely many heads. Our results feature several implications unique to the attention structure compared with existing random features theory for neural networks, such as (1) Advantages in the sample complexity over standard two-layer random-feature networks; (2) Concrete and natural classes of functions that can be learned efficiently by a random-feature attention layer; and (3) The effect of the sampling distribution of the query-key weight matrix (the product of the query and key matrix), where Gaussian random weights with a non-zero mean result in better sample complexities over the zero-mean counterpart for learning certain natural target functions. Experiments on simulated data corroborate our theoretical findings and further illustrate the interplay between the sample size and the complexity of the target function.

READ FULL TEXT
research
05/11/2023

Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks

One of the central questions in the theory of deep learning is to unders...
research
05/22/2023

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Multi-query attention (MQA), which only uses a single key-value head, dr...
research
11/20/2020

On Random Matrices Arising in Deep Neural Networks: General I.I.D. Case

We study the distribution of singular values of product of random matric...
research
02/17/2022

Universality of empirical risk minimization

Consider supervised learning from i.i.d. samples { x_i,y_i}_i≤ n where x...
research
03/01/2023

Learning curves for deep structured Gaussian feature models

In recent years, significant attention in deep learning theory has been ...
research
02/19/2021

A theory of capacity and sparse neural encoding

Motivated by biological considerations, we study sparse neural maps from...
research
02/02/2019

Complexity, Statistical Risk, and Metric Entropy of Deep Nets Using Total Path Variation

For any ReLU network there is a representation in which the sum of the a...

Please sign up or login with your details

Forgot password? Click here to reset