A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features

by   Zhenmei Shi, et al.

An important characteristic of neural networks is their ability to learn representations of the input data with effective features for prediction, which is believed to be a key factor to their superior empirical performance. To better understand the source and benefit of feature learning in neural networks, we consider learning problems motivated by practical data, where the labels are determined by a set of class relevant patterns and the inputs are generated from these along with some background patterns. We prove that neural networks trained by gradient descent can succeed on these problems. The success relies on the emergence and improvement of effective features, which are learned among exponentially many candidates efficiently by exploiting the data (in particular, the structure of the input distribution). In contrast, no linear models on data-independent features of polynomial sizes can learn to as good errors. Furthermore, if the specific input structure is removed, then no polynomial algorithm in the Statistical Query model can learn even weakly. These results provide theoretical evidence showing that feature learning in neural networks depends strongly on the input structure and leads to the superior performance. Our preliminary experimental results on synthetic and real data also provide positive support.


page 1

page 2

page 3

page 4


Random Feature Amplification: Feature Learning and Generalization in Neural Networks

In this work, we provide a characterization of the feature-learning proc...

Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks

One of the central questions in the theory of deep learning is to unders...

Hybrid-DNNs: Hybrid Deep Neural Networks for Mixed Inputs

Rapid development of big data and high-performance computing have encour...

Gradient Starvation: A Learning Proclivity in Neural Networks

We identify and formalize a fundamental gradient descent phenomenon resu...

Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning

We consider the optimisation of large and shallow neural networks via gr...

The Benefits of Mixup for Feature Learning

Mixup, a simple data augmentation method that randomly mixes two data po...

The Curious Case of Benign Memorization

Despite the empirical advances of deep learning across a variety of lear...

Please sign up or login with your details

Forgot password? Click here to reset