The Benefits of Mixup for Feature Learning

by   Difan Zou, et al.

Mixup, a simple data augmentation method that randomly mixes two data points via linear interpolation, has been extensively applied in various deep learning applications to gain better generalization. However, the theoretical underpinnings of its efficacy are not yet fully understood. In this paper, we aim to seek a fundamental understanding of the benefits of Mixup. We first show that Mixup using different linear interpolation parameters for features and labels can still achieve similar performance to the standard Mixup. This indicates that the intuitive linearity explanation in Zhang et al., (2018) may not fully explain the success of Mixup. Then we perform a theoretical study of Mixup from the feature learning perspective. We consider a feature-noise data model and show that Mixup training can effectively learn the rare features (appearing in a small fraction of data) from its mixture with the common features (appearing in a large fraction of data). In contrast, standard training can only learn the common features but fails to learn the rare features, thus suffering from bad generalization performance. Moreover, our theoretical analysis also shows that the benefits of Mixup for feature learning are mostly gained in the early training phase, based on which we propose to apply early stopping in Mixup. Experimental results verify our theoretical findings and demonstrate the effectiveness of the early-stopped Mixup training.


page 1

page 2

page 3

page 4


Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup

Mixup is a data augmentation technique that relies on training using ran...

The Curious Case of Benign Memorization

Despite the empirical advances of deep learning across a variety of lear...

On the Joint Interaction of Models, Data, and Features

Learning features from data is one of the defining characteristics of de...

A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features

An important characteristic of neural networks is their ability to learn...

Towards Understanding Feature Learning in Out-of-Distribution Generalization

A common explanation for the failure of out-of-distribution (OOD) genera...

Learning sparse features can lead to overfitting in neural networks

It is widely believed that the success of deep networks lies in their ab...

Causal Feature Learning for Utility-Maximizing Agents

Discovering high-level causal relations from low-level data is an import...

Please sign up or login with your details

Forgot password? Click here to reset