Learning threshold neurons via the «edge of stability»

Kwangjun Ahn; Sébastien Bubeck; Sinho Chewi; Yin Tat Lee; Felipe Suarez; Yi Zhang

Learning threshold neurons via the «edge of stability»

Kwangjun Ahn ,
Sébastien Bubeck ,
Sinho Chewi ,
Yin Tat Lee ,
Felipe Suarez ,
Yi Zhang

NeurIPS 2023 | October 2023

Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the «edge of stability» or «unstable convergence») and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn «threshold-like» neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks.

Physics of AI

We propose an approach to the science of deep learning that roughly follows what physicists do to understand reality: (1) explore phenomena through controlled experiments, and (2) build theories based on toy mathematical models and non-fully- rigorous mathematical reasoning. I illustrate (1) with the LEGO study (LEGO stands for Learning Equality and Group Operations), where we observe how transformers learn to solve simple linear systems of equations. I will also briefly illustrate (2) with an analysis of the emergence of threshold units when training a two-layers neural network to solve a simple sparse coding problem. The latter analysis connects to the recently discovered Edge of Stability phenomenon.