Machine learning’s rapid emergence and pervasive impact has revolutionized industries and societies across the globe. Its ability to extract insights, recognize patterns, and make intelligent predictions from vast amounts of data has paved the way for a new era of progress. From traffic and weather prediction to speech pattern recognition and advanced medical diagnostics, machine learning has been shattering the boundaries of possibility, inviting us to explore new frontiers of innovation.
The International Conference on Machine Learning (ICML 2023) serves as a global platform where researchers, academics, and industry professionals gather to share their pioneering work and advancements in the field of machine learning. As a supporter of machine learning research, Microsoft takes an active role in ICML, not only as a sponsor but also as a significant research contributor.
The breadth of contributions from Microsoft researchers and their collaborators at ICML reflects the various and diverse possibilities for applying machine learning.
Spotlight: Microsoft research newsletter
Here are some of the highlights:
Oral sessions
BEATs: Audio Pre-Training with Acoustic Tokenizers
Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, and Furu Wei explore the growth of self-supervised learning (SSL) across language, vision, speech, and audio domains. They propose an iterative framework, BEATs, which combines acoustic tokenizers and audio SSL models and promotes semantic-rich discrete label prediction, facilitating the abstraction of high-level audio semantics. Experimental results demonstrate BEATs’ effectiveness, achieving state-of-the-art performance on various audio classification benchmarks, including AudioSet-2M and ESC-50.
Representation Learning with Multi-Step Inverse Kinematics: An Efficient and Optimal Approach to Rich-Observation RL
Zakaria Mhammedi, Dylan Foster, and Alexander Rakhlin introduce MusIK, a computationally efficient algorithm for sample-efficient reinforcement learning with complex observations. MusIK overcomes limitations of existing methods by achieving rate-optimal sample complexity and minimal statistical assumptions. It combines systematic exploration with multi-step inverse kinematics to predict the learner’s future actions based on current observations.
Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies
Gati Aher, Rosa Arriaga, and Adam Tauman Kalai present the Turing Experiment (TE), a novel approach for evaluating how well language models can simulate different aspects of human behavior. Unlike the traditional Turing Test, a TE requires representative samples of participants from human subject research. The methodology enables the replication of well-established findings in economic, psycholinguistic, and social psychology experiments, such as the Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. Results demonstrate successful replication in the first three TEs, while uncovering a “hyper-accuracy distortion” in some language models during the last TE.
Other paper highlights
Bayesian Estimation of Differential Privacy
Differentially private stochastic gradient descent (SGD) algorithms provide formal privacy guarantees for training ML models, offering better protection against practical attacks. Researchers estimate protection levels using ε confidence intervals from membership inference attacks, but obtaining actionable intervals requires training an impractically large number of models. Santiago Zanella-Béguelin, Lukas Wutschitz, Shruti Tople, Ahmed Salem, Victor Ruehle, Andrew Paverd, Mohammad Naseri, Boris Köpf, and Daniel Jones propose a novel, more efficient Bayesian approach that brings privacy estimates within reach of practitioners. It reduces sample size by computing a posterior for ε from the joint posterior of the false positive and false negative rates of membership inference attacks. This approach also implements an end-to-end system for privacy estimation that integrates our approach and state-of-the-art membership inference attacks and evaluates it on text and vision classification tasks.
Magneto: A Foundation Transformer
Model architectures across language, vision, speech, and multimodal are converging. Despite being called “transformers,” these areas use different implementations for better performance. Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, and Furu Wei call for developing a foundation transformer for true general-purpose modeling to serve as a go-to architecture for various tasks and modalities with guaranteed training stability. This work introduces Magneto, a transformer variant, to meet that goal. The authors propose Sub-LayerNorm for good expressivity and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than de facto transformer variants designed for various applications, including language modeling, machine translation, vision pretraining, speech recognition, and multimodal pretraining.
NeuralStagger: Accelerating Physics-Constrained Neural PDE Solver with Spatial-Temporal Decomposition
Neural networks accelerate partial differential equation (PDE) solutions but need physics constraints for generalization and to reduce reliance on data. Ensuring accuracy and stability requires resolving smallest scaled physics, increasing computational costs due to large inputs, outputs, and networks. Xinquan Huang, Wenlei Shi, Qi Meng, Yue Wang, Xiaotian Gao, Jia Zhang, and Tie-Yan Liu propose an acceleration methodology, NeuralStagger, which spatially and temporally decomposes the original learning tasks into several coarser-resolution subtasks. They define a coarse-resolution neural solver for each subtask, requiring fewer computational resources, and jointly train them with a physics-constrained loss. The solution is achieved quickly thanks to perfect parallelism, while trained solvers provide the flexibility to simulate at various resolutions.
Streaming Active Learning with Deep Neural Networks
Active learning is perhaps most naturally posed as an online learning problem. However, prior active learning approaches with deep neural networks assume offline access to the entire dataset ahead of time. Akanksha Saran, Safoora Yousefi, Akshay Krishnamurthy, John Langford, and Jordan Ash propose VeSSAL, a new algorithm for batch active learning with deep neural networks in streaming settings, which samples groups of points to query for labels at the moment they are encountered. The approach trades off between the uncertainty and diversity of queried samples to match a desired query rate without requiring any hand-tuned hyperparameters. This paper expands the applicability of deep neural networks to realistic active learning scenarios, such as applications relevant to HCI and large fractured datasets.
For the complete list of accepted publications by Microsoft researchers, please see the publications list on Microsoft at ICML 2023.