What can linearized neural networks actually say about generalization?
For certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization. However, for the networks used in practice, the empirical NTK represents only a rough first-order approximation of these architectures. Still, a growing body of work keeps leveraging this approximation to successfully analyze important deep learning phenomena and derive algorithms for new applications. In our work, we provide strong empirical evidence to determine the practical validity of such approximation by conducting a systematic comparison of the behaviour of different neural networks and their linear approximations on different tasks. We show that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks, albeit with important nuances. Specifically, we discover that, in contrast to what was previously observed, neural networks do not always perform better than their kernel approximations, and reveal that their performance gap heavily depends on architecture, number of samples and training task. In fact, we show that during training, deep networks increase the alignment of their empirical NTK with the target task, which explains why linear approximations at the end of training can better explain the dynamics of deep networks. Overall, our work provides concrete examples of novel deep learning phenomena which can inspire future theoretical research, as well as provides a new perspective on the use of the NTK approximation in deep learning.
READ FULL TEXT