March 25, 2023

2023 Workshop on Machine Learning Theory and Foundations

Location: Beijing, China

Sebastien Bubeck
Microsoft Research
Talk title: Analysis of a Toy Case for Emergence

Emergent behavior in transformers is disrupting the field of deep learning (which itself has disrupted the field of machine learning, which itself has disrupted … you get the idea). What is going on? I will present an analysis of the non-convex dynamic that leads to emergence in a special toy case (the sparse coding model) which can be viewed as an abstraction of the emergence of edge detectors in convolutional neural networks. The surprising part is that we connect emergence to the well-documented phenomenon of instability in training neural networks.

Joint work with Kwangjun Ahn, Sinho Chewi, Yin Tat Lee, Felipe Suarez and Yi Zhang

Sebastien Bubeck leads the Machine Learning Foundations group at Microsoft Research Redmond. He is now bullish on AGI.

Yuan Cao
University of Hong Kong
Talk title: Benign Overfitting in Two-layer Convolutional Neural Networks

Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as “benign overfitting”. Recently, there emerged a line of works studying benign overfitting from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this talk, I will present some learning guarantees of two-layer convolutional neural networks (CNN). We first show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve a constant level of test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. We also construct an example where gradient descent can train a two-layer CNN to obtain small test errors, while Adam can only achieve constant-level test errors. This further demonstrates the impact of optimization algorithms on benign and harmful overfitting.

Yuan Cao is an assistant professor in the Department of Statistics and Actuarial Science and Department of Mathematics at the University of Hong Kong. Before joining HKU, he was postdoctoral scholar at UCLA working with Professor Quanquan Gu. He received his B.S. from Fudan University and Ph.D. from Princeton University. Yuan’s research interests include the theory of deep learning, non-convex optimization, and high-dimensional statsitcs.

Li Dong
Microsoft Research Asia
Talk title: Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers

Large pretrained language models have shown surprising In-Context Learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without additional parameter updates. Despite the great success in performance, the working mechanism of ICL still remains an open problem. In order to better understand how ICL works, this paper explains language models as meta-optimizers and understands ICL as a kind of implicit finetuning. Theoretically, we figure out that the Transformer attention has a dual form of gradient descent based optimization. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. Experimentally, we comprehensively compare the behavior of ICL and explicit finetuning based on real tasks to provide empirical evidence that supports our understanding. The results prove that ICL behaves similarly to explicit finetuning at the prediction level, the representation level, and the attention behavior level. Further, inspired by our understanding of meta-optimization, we design a momentum-based attention by analogy with the momentum-based gradient descent algorithm. Its consistently better performance over vanilla attention supports our understanding again from another aspect, and more importantly, it shows the potential to utilize our understanding for future model designing.

Li Dong is a Principal Researcher at Microsoft Research. Previously, he received his PhD in School of Informatics at University of Edinburgh. He has been focusing on large-scale self-supervised learning across tasks, languages, and modalities. Li’s research has been recognized through the AAAI/ACM SIGAI Doctoral Dissertation Award Runner Up, the ACL-2018 Best Paper Honourable Mention, the AAAI-2021 Best Paper Runner Up, and fellowship from Microsoft. He also served as an area chair for ACL, EMNLP, NAACL multiple times.

Cong Fang
Peking University
Talk title: Environment Invariant Linear Least Squares

We consider a multiple environment linear regression model, in which data from multiple experimental settings are collected. The joint distribution of the response variable and covariate may vary across different environments, yet the conditional expectation of y given the unknown set of important variables are invariant. Such a statistical model is related to the problem of endogeneity, transfer learning, and causal inference. We construct a novel environment invariant linear least squares (EILLS) objective function, a multiple-environment version of linear least squares that leverages the above conditional expectation invariance structure together with the heterogeneity among different environments to determine the true parameter. Our proposed method is applicable under the minimal structural assumption. We establish non-asymptotic error bounds on the estimation error for the EILLS estimator in the presence of endogenous variables. Moreover, we further show that the sparsity penalized EILLS estimator can achieve variable selection consistency in high-dimensional regimes. These non-asymptotic results demonstrate the sample efficiency of the EILLS estimator and its capability to circumvent the curse of endogeneity in an algorithmic manner without any prior structural knowledge.

Cong Fang is an assistant professor at Peking University. He received a Ph.D. degree from Peking University in 2019 and was a Post-Doctoral Researcher at Princeton University in 2020 and the University of Pennsylvania in 2021. He works on the foundation of machine learning，now with special interests in optimization, learning theory, neural network theory, and sampling algorithms.

Jonathan Frankle
MosaicML
Talk title: Faster Neural Network Training, Algorithmically

Training modern neural networks is time-consuming, expensive, and energy-intensive. As neural network training costs double every few months, it is difficult for researchers and businesses without immense budgets to keep up, especially as hardware improvements stagnate. In this talk, I will describe my favored approach for managing this challenge: changing the workload itself – the training algorithm. Unlike most workloads in computer science, machine learning is approximate, and we need not worry about changing the underlying algorithm so long as we properly account for the consequences. I will discuss how we have put this approach into practice at MosaicML, including the dozens of algorithmic changes we have studied (which are freely available open source), the science behind how these changes interact with each other (the composition problem), and how we evaluate whether these changes have been effective. I will also detail several surprises we have encountered and lessons we have learned along the way. In the time since we began this work, we have reduced the training times of standard computer vision models by 5-7x and standard language models by 2-3x, and we’re just scratching the surface. I will close with a number of open research questions we have encountered that merit the attention of the research community. This is the collective work of a dozen empirical deep learning researchers at MosaicML, and I’m simply the messenger.

Jonathan Frankle is Chief Scientist at MosaicML, where he leads the company’s research team toward the goal of developing more efficient algorithms for training neural networks. In his PhD at MIT, he empirically studied deep learning with Prof. Michael Carbin, specifically the properties of sparse networks that allow them to train effectively (his “Lottery Ticket Hypothesis” – ICLR 2019 Best Paper). In addition to his technical work, he is actively involved in policymaking around challenges related to machine learning. He will be joining the computer science faculty at Harvard in the fall of 2023. He earned his BSE and MSE in computer science at Princeton and has previously spent time at Google Brain, Facebook AI Research, and Microsoft as an intern and Georgetown Law as an “Adjunct Professor of Law.”

Surya Ganguli
Stanford University
Talk title: Beyond Neural Scaling Laws: Towards Data Efficient Deep Learning

Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. We show how to break beyond this power law scaling with respect to data, sometimes achieving exponential scaling, both in theory and practice, through careful data-pruning. We additionally develop a new simple, cheap and scalable self-supervised data pruning algorithm that demonstrates comparable performance to the best supervised data pruning algorithms (that in contrast require class labels). We demonstrate empirically better than power scaling on ResNets trained on CIFAR-10, SVHN, and ImageNet. We furthermore expand our empirical work to data pruning at web-scaling, showing how to prune an already highly curated subset of 440M LAION image-text pairs down to 270M without suffering any loss in accuracy on > 20 downstream tasks. This work suggests careful data curation and selection may constitute the next arena for performance gains in machine learning.

Surya Ganguli triple majored in physics, mathematics, and EECS at MIT, completed a PhD in string theory at Berkeley, and a postdoc in theoretical neuroscience at UCSF. He has also done AI research at both Google Brain and Meta AI. He is now an associate professor of Applied physics at Stanford where he leads the Neural Dynamics and Computation Lab. His research spans the fields of neuroscience, machine learning and physics, focusing on understanding and improving how both biological and artificial neural networks learn striking emergent computations. He has been awarded a Swartz-Fellowship in computational neuroscience, a Burroughs-Wellcome Career Award, a Terman Award, two NeurIPS Outstanding Paper Awards, a Sloan fellowship, a James S. McDonnell Foundation scholar award in human cognition, a McKnight Scholar award in Neuroscience, a Simons Investigator Award in the mathematical modeling of living systems, and an NSF career award.

Boris Hanin
Princeton University
Talk title: Bayesian Interpolation with Deep Linear Networks

This talk is based on joint work (arXiv:2212.14457) with Alexander Zlokapa, which gives exact non-asymptotic formulas for Bayesian posteriors in deep linear networks. After providing some general motivation, I will focus on explaining results of two kinds. First, I will state a precise result showing that infinitely deep linear networks compute optimal posteriors starting from universal, data-agnostic priors. Second, I will explain how a novel scaling parameter – given by # data * depth / width – controls the effective depth and complexity of the posterior.

Professor Hanin is an Assistant Professor at Princeton’s Operations Reseach and Financial Engineering Department working on theoretical machine learning, probability, and spectral theory.

Di He
Peking University
Talk title: Which Graph Neural Network Can Provably Solve Practical Problems?

Designing expressive Graph Neural Networks(GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs with respect to the Weisfeiler-Lehman (WL) test, for most of them, there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this work, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics vla graph biconnectivty and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. We introduce a principled and efficient approach called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.

Di He is an Assistant Professor at Peking University. He was previously a Senior Researcher in Machine Learning Group (now Al4Science) at Microsoft Research Asia. Di’s main research interests include representation learning(mainly focusing on learning the representation of languages and graphs), trust-worthy machine learning, and learning methods for scientific problems. His work aims to develop efficient algorithms that can capture accurate and robust features from data through deep neural networks. Di has served as the Area Chair of the top machine learning and artificial intelligence conferences, including ICML.NIPS.ICLR and CVPR.

Kenji Kawaguchi
National University of Singapore
Talk title: On the Theoretical Understanding of Mixup

Mixup is a popular data augmentation technique for training deep neural networks where additional samples are generated by linearly interpolating pairs of inputs and their labels. This technique is known to improve the generalization performance in many learning paradigms and applications. In this talk, I will discuss some of the theoretical understandings of Mixup.

Kenji Kawaguchi is a Presidential Young Professor in the Department of Computer Science at National University of Singapore. Kenji Kawaguchi received his Ph.D. in Computer Science from MIT. He then joined Harvard University as a postdoctoral fellow. He was also an invited participant at the University of Cambridge, Isaac Newton Institute for Mathematical Sciences program on “Mathematics of Deep Learning”. His research interests include deep learning, as well as artificial intelligence (AI) in general. His research lab aims to have a positive feedback loop between theory and practice in deep learning research through collaborations with researchers from both practice and theory sides.

Zhiyuan Li
Stanford University
Talk title: How Does Sharpness-Aware Minimization Minimize Sharpness?

Sharpness-Aware Minimization (SAM) is a highly effective regularization technique for improving the generalization of deep neural networks for various settings. However, the underlying working of SAM remains elusive because of various intriguing approximations in the theoretical characterizations. SAM intends to penalize a notion of sharpness of the model but implements a computationally efficient variant; moreover, a third notion of sharpness was used for proving generalization guarantees. The subtle differences in these notions of sharpness can indeed lead to significantly different empirical results. This paper rigorously nails down the exact sharpness notion that SAM regularizes and clarifies the underlying mechanism. We also show that the two steps of approximations in the original motivation of SAM individually lead to inaccurate local conclusions, but their combination accidentally reveals the correct effect, when full-batch gradients are applied. Furthermore, we also prove that the stochastic version of SAM in fact regularizes the third notion of sharpness mentioned above, which is most likely to be the preferred notion for practical performance. The key mechanism behind this intriguing phenomenon is the alignment between the gradient and the top eigenvector of Hessian when SAM is applied.

Zhiyuan Li is an incoming tenure-tracked assistant professor of Toyota Technological Institute at Chicago (TTIC) starting from 2023 Fall and currently a postdoc in Stanford CS. He obtained his Ph.D. in computer science at Princeton University in 2022. His research focuses on machine learning theory, especially generalization of overparametrized models and non-convex optimization. He is a recipient of Microsoft Research PhD Fellowship.

Qiang Liu
University of Texas at Austin
Talk title: Flow Straight and Fast: A Simple and Unified Approach to Generative Modeling, Domain Transfer, and Optimal Transport

We consider the problem of learning a transport mapping between two distributions that are only observed through unpaired data points. This problem provides a unified framework for a variety of fundamental tasks in machine learning: generative modeling is about transforming a Gaussian (or other elementary) random variable to realistic data points; domain transfer concerns with transferring data points from one domain to another; optimal transport (OT) solves the more challenging problem of finding a “best” transport map that minimizes certain transport cost. Unfortunately, despite the unified view, there lacks an algorithm that can solve the transport mapping problem efficiently in all settings. The existing algorithms need to be developed case by case, and tend to be complicated and computationally expensive.

In this talk, I will show you that the problems above can be addressed unifiedly in a pretty simple way. This algorithm, called rectified flow, learns an ordinary differential equation (ODE) model to transfer between the two distributions by following straight paths as much as possible. The algorithm only requires solving a sequence of nonlinear least squares optimization problems, which guarantees to yield monotonically non-increasing couplings w.r.t. all convex transport costs. The straight paths are special and preferred because they are the shortest paths between two points, and can be simulated exactly without time discretization, yielding computationally efficient models. In practice, the ODE models learned by our method can generate high quality images with a single discretization step, which is a significant speedup over existing diffusion generative models. Moreover, with a proper modification, our method can be used to solve the OT problems on high dimensional continuous distributions, a challenging problem for which no well accepted efficient algorithms exist.

Qiang Liu is an assistant professor of Computer Science at UT Austin. He is interested in studying and developing fundamental yet computationally feasible algorithms for the basic learning, inference, and optimization problems and exploring their applications.

Maxim Raginsky
University of Illinois at Urbana-Champaign (UIUC)
Tak title: Variational Principles for Mirror Descent and Mirror Langevin Dynamics

Mirror descent, introduced by Nemirovsky and Yudin in the 1970s, is a primal-dual convex optimization method that can be tailored to the geometry of the optimization problem at hand through the choice of a strongly convex distance-generating potential function. It arises as a basic primitive in a variety of applications, including large-scale optimization, machine learning, and control. In this talk, based on joint work with Belinda Tzen, Anant Raj, and Francis Bach, I will discuss a variational formulation of mirror descent and of its stochastic variant, mirror Langevin dynamics. The main idea, inspired by classic work of Brezis and Ekeland, is to show that mirror descent emerges as a closed-loop solution for a certain optimal control problem, and the Bellman value function is given by the Bregman divergence, in the dual space, between the initial condition and the global minimizer of the objective function. This formulation has several interesting corollaries and implications, including a form of implicit regularization, which I will discuss.

Maxim Raginsky received the B.S. and M.S. degrees in 2000 and the Ph.D. degree in 2002 from Northwestern University, all in Electrical Engineering. He has held research positions with Northwestern, the University of Illinois at Urbana-Champaign (where he was a Beckman Foundation Fellow from 2004 to 2007), and Duke University. In 2012, he has returned to the UIUC, where he is currently a Professor and William L. Everitt Fellow with the Department of Electrical and Computer Engineering and the Coordinated Science Laboratory. He also holds a courtesy appointment with the Department of Computer Science. Prof. Raginsky’s interests cover probability and stochastic processes, deterministic and stochastic control, machine learning, optimization, and information theory. Much of his recent research is motivated by fundamental questions in modeling, learning, and simulation of nonlinear dynamical systems, with applications to advanced electronics, autonomy, and artificial intelligence. Prof. Raginsky was a Program Co-Chair of the 2022 Conference on Machine Learning (COLT).

Masashi Sugiyama
RIKEN/The University of Tokyo
Talk title: Adapting to Distribution Shifts: Recent Advances in Importance Weighting Methods

Distribution shifts are conceivable in practical machine learning scenarios, such as when a model is trained on data collected in different environments, or when a model is used in a test environment that is different from the training environment. The use of an importance-weighted loss function is a classical approach to deal with such distribution shifts. In this talk, I will give an overview of our recent advances in importance-based distribution shift adaptation methods, including joint importance-predictor estimation for covariate shift adaptation, dynamic importance weighting for joint distribution shift adaptation, and multistep class prior shift adaptation.

Masashi Sugiyama received his Ph.D. in Computer Science from the Tokyo Institute of Technology in 2001. He has been a professor at the University of Tokyo since 2014, and simultaneously the director of the RIKEN Center for Advanced Intelligence Project (AIP) since 2016. His research interests include theories and algorithms of machine learning. In 2022, he received the Award for Science and Technology from the Japanese Minister of Education, Culture, Sports, Science and Technology. He was program co-chair of the Neural Information Processing Systems (NeurIPS) conference in 2015, the International Conference on Artificial Intelligence and Statistics (AISTATS) in 2019, and the Asian Conference on Machine Learning (ACML) in 2010 and 2020. He is (co-)author of Machine Learning in Non-Stationary Environments (MIT Press, 2012), Density Ratio Estimation in Machine Learning (Cambridge University Press, 2012), Statistical Reinforcement Learning (Chapman & Hall, 2015), Introduction to Statistical Machine Learning (Morgan Kaufmann, 2015), and Machine Learning from Weak Supervision (MIT Press, 2022).

Zhiqin Xu
Shanghai Jiaotong University
Talk title: Condensation in Deep Learning

Why do neural networks (NN) that look so complex usually generalize well? To understand this problem, we find some simple implicit regularizations during training NNs. The first is the frequency principle that NNs learn from low frequency to high frequency. The second is the parameter condensation, a feature of non-linear training process, which makes the network size effectively much smaller. Based on the condensation, we find an intrinsic embedding principle of NN loss landscape and develop a rank analysis framework to quantitatively understand how much data size an overparameterized NN needs in order to generalize well.

Zhi-Qin John Xu is an associate professor at Shanghai Jiao Tong University (SJTU). Zhi-Qin obtain B.S. in Physics (2012) and a Ph.D. degree in Mathematics (2016) from SJTU. Before joining SJTU, Zhi-Qin worked as a postdoc at NYUAD and Courant Institute from 2016 to 2019. He published papers on JMLR, AAAI, NeurIPS, JCP, CiCP, SIMODS etc. He is a managing editor of Journal of Machine Learning.

a man wearing a suit and tie smiling at the camera

Yang Yuan
Tsinghua University
Talk title: Contrastive Learning Is Spectral Clustering on Similarity Graph

Contrastive learning is a powerful self-supervised learning method, but we have limited theoretical understanding of how it works and why it works. In this paper, we prove that contrastive learning is equivalent to spectral clustering on the similarity graph. Using this equivalence as the building block, we investigate the CLIP model and rigorously characterize how similar multi-modal objects are embedded together with the representation theorems. Inspired by theory, we propose new kernels that can achieve better performance than the standard kernel on several vision datasets.

Yang Yuan is now an assistant professor at IIIS, Tsinghua. He finished his undergraduate study at Peking University in 2012. Afterwards, he received his PhD at Cornell University in 2018, advised by Professor Robert Kleinberg. During his PhD, he was a visiting student at MIT/Microsoft New England (2014-2015) and Princeton University (2016 Fall). Before joining Tsinghua, he spent one year at MIT Institute for Foundations of Data Science (MIFODS) as a postdoc researcher. He works on AI+Healthcare, AI Interpretability and AI system.