{"id":1109289,"date":"2024-12-09T10:37:27","date_gmt":"2024-12-09T18:37:27","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-story&p=1109289"},"modified":"2025-08-28T20:47:46","modified_gmt":"2025-08-29T03:47:46","slug":"microsoft-at-neurips-2024-advancing-ai-research-across-domains","status":"publish","type":"msr-story","link":"https:\/\/www.microsoft.com\/en-us\/research\/story\/microsoft-at-neurips-2024-advancing-ai-research-across-domains\/","title":{"rendered":"Microsoft at NeurIPS 2024: Advancing AI research across domains"},"content":{"rendered":"\n
<\/span>
<\/div>
\n
\n
<\/div>\n\n\n\n

Microsoft at NeurIPS 2024: Advancing AI research across domains<\/h1>\n\n\n\n
<\/div>\n<\/div>\n<\/div><\/div>\n\n\n\n
\n
\n\t\n\t
\n\t\t
\n\t\t\t
<\/div>\t\t<\/div>\n\t<\/div>\n\n\t<\/div>\n\n\n\n
\n
<\/div>\n\n\n\n
\n
<\/div>\n\n\n\n

Microsoft is proud to sponsor the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), a leading global forum for machine learning and AI.<\/h3>\n\n\n\n
\n
Microsoft at NeurIPS AI experience<\/a><\/div>\n\n\n\n
Microsoft at NeurIPS 2024<\/a><\/div>\n<\/div>\n\n\n\n

The event gathers researchers, industry leaders, and practitioners to exchange ideas, address challenges, and advance innovations to shape the future of AI. Lidong Zhou<\/a>, managing director of Microsoft Research Asia<\/a>, will be one of this year\u2019s keynote speakers.<\/p>\n\n\n\n

More than 100 papers by Microsoft researchers and collaborators have been accepted at NeurIPS 2024, including five oral presentations and 19 spotlight sessions. While these research projects cover a broad range of topics, a shared theme ties them together: advancing the efficiency, scalability, and robustness of machine learning models while addressing real-world challenges like human-centric interaction and cultural considerations.<\/p>\n\n\n\n

\n

Visit us at Booth #445<\/a><\/p>\n<\/blockquote>\n<\/div>\n\n\n\n

<\/div>\n<\/div>\n\n\n
\n

NeurIPS oral presentations<\/h2>\n<\/div>\n\n\n
\n
\n
\n
\"illustration<\/figure>\n<\/div>\n\n\n\n
\n

Not All Tokens Are What You Need for Pretraining<\/h2>\n\n\n\n

Recipient of “Best Paper Runner Up Award”<\/strong>
Yeyun Gong<\/em><\/a>, <\/em>Xiao Liu<\/em><\/a>, Yelong Shen, Ruochen Xu, Jian Jiao, Nan Duan, <\/em>Weizhu Chen<\/em><\/a>

<\/a>Rho-1 is a new language model that uses selective language modeling. Unlike traditional language models that predict every next token, Rho-1 selectively trains on tokens aligned with the desired distribution. This involves scoring pretraining tokens using a reference model and then training the language model with a focused loss on tokens with higher scores.<\/p>\n\n\n\n

\n
Read the paper<\/a><\/div>\n\n\n\n
Listen to the podcast<\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n
<\/div>\n\n\n\n
\n
\n

<\/h2>\n<\/div>\n\n\n

<\/p>\n<\/div>\n\n\n\n

<\/div>\n\n\n\n
\n
\n
\n
\n

Reinforcement Learning Under Latent Dynamics: Toward Statistical and Algorithmic Modularity<\/h4>\n\n\n\n

Philip Amortila, Dylan J. Foster<\/a>, Nan Jiang, Akshay Krishnamurthy<\/a>, Zakaria Mhammedi
<\/em>
This research investigates reinforcement learning under general latent dynamics, demonstrating that traditional function approximation becomes intractable with rich observations unless latent pushforward coverability is present. The authors also developed efficient reductions to adapt latent Markov decision process (MDP) algorithms for complex observations, providing a foundation for a unified statistical and algorithmic theory for reinforcement learning under latent dynamics.<\/p>\n\n\n\n

\n
Read the paper<\/a><\/div>\n\n\n\n
Listen to the podcast<\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n
\n
\n
\n

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark<\/h4>\n\n\n\n

David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Pranjal Chitale (opens in new tab)<\/span><\/a>, et al.<\/em>

CVQA is a culturally diverse, multilingual visual question-answering benchmark that involves native speakers and cultural experts in the data collection process. It includes culturally driven images and questions from 30 countries across four continents, covering 31 languages and 13 scripts, and provides a total of 10k questions. While it is a challenging benchmark for current state-of-the-art multimodal large language models (MLLMs), it is also a tool to assess cultural capabilities and biases in these models.<\/p>\n\n\n\n

\n
Read the paper<\/a><\/div>\n\n\n\n
Listen to the podcast<\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n
\n
\n
\n

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time<\/h4>\n\n\n\n

Sicheng Xu, Guojun Chen<\/a>, Yu-Xiao Guo, Jiaolong Yang<\/a>, Chong Li, Zhenyu Zang, Yizhong Zhang<\/a>, Xin Tong, Baining Guo<\/a><\/em>

VASA is a framework for generating lifelike talking faces with visual affective skills (VAS) from a static image and audio clip. The premiere model, VASA-1, synchronizes lip movements with speech while capturing facial nuances and natural head motions, enabled by a holistic facial dynamics and head movement generation model and an expressive face latent space built from video data.<\/p>\n\n\n\n

\n
Read the paper<\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n
\n
\n
\n

You Only Cache Once: Decoder-Decoder Architectures for Language Models<\/h4>\n\n\n\n

Yutao Sun, Li Dong<\/a>, Yi Zhu<\/a>, Shaohan Huang<\/a>, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei<\/a><\/em>

You only cache once (YOCO) is a decoder-decoder architecture for LLMs that reduces GPU memory usage by caching key-value pairs only once, while retaining global attention. A self-decoder encodes key-value caches that are reused by a cross-decoder that leverages cross-attention. This enables YOCO to speed up the prefill stage through a computation flow that allows early exit without altering the final output.<\/p>\n\n\n\n

\n
Read the paper<\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n
<\/div>\n\n\n\n
\n
\n
<\/div>\n\n\n\n
\n
<\/div>\n\n\n\n
\n\n\n
\n

NeurIPS spotlight sessions<\/h3>\n<\/div>\n\n\n

ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models<\/strong><\/a>
Jio Oh, Soyeon Kim, Junseok Seo,
Jindong Wang<\/a>, Ruochen Xu, Xing Xie<\/a>, Steven Euijong Whang<\/em>
To thoroughly analyze LLMs, the authors propose ERBench, which automatically converts any relational database into a benchmark based on the entity-relationship model.<\/p>\n\n\n\n

A Study of Plasticity Loss in On-Policy Deep Reinforcement Learning<\/a><\/strong>
Arthur Juliani,
Jordan Ash<\/a><\/em>
The authors to conduct extensive experiments on plasticity loss in on-policy deep reinforcement learning and various mitigation methods.<\/p>\n\n\n\n

Advancing Spiking Neural Networks for Sequential Modeling through Central Pattern Generators<\/a><\/strong>
Changze Lv, <\/em>
Dongqi Han<\/em><\/a>, <\/em>Yansen Wang<\/em><\/a>, Xiaoqing Zheng, Xuanjing Huang, <\/em>Dongsheng Li<\/em><\/a><\/em>
CPG-PE is a novel positional encoding (PE) technique for spiking neural networks inspired by central pattern generators in the human brain.<\/p>\n\n\n\n

Assouad, Fano, and Le Cam with Interaction: A Unifying Lower Bound Framework and Characterization for Bandit Learnability<\/strong>
<\/a>Fan Chen,
Dylan J. Foster<\/a>, Yanjun Han, Jian Qian, Alexander Rakhlin, Yunbei Xu<\/em>
The authors develop a unified framework for lower bound methods in statistical estimation and interactive decision making. They also propose a unified view of these distinct methodologies.<\/p>\n\n\n\n

BPQP: A Differentiable Convex Optimization Framework for Efficient End-to-End Learning<\/strong><\/a>
Xiao Yang<\/em><\/a>, Xu Yang, <\/em>Weiqing Liu<\/em><\/a>, Lewen Wang, <\/em>Jiang Bian<\/em><\/a>
To enhance efficiency, the authors reformulate the backward pass as a simplified and decoupled quadratic programming problem by leveraging the structural properties of the Karush\u2013Kuhn\u2013Tucker (KKT) matrix.<\/p>\n\n\n\n

Compositional Generalization Across Distributional Shifts with Sparse Tree Operations<\/strong><\/a>
Paul Smolensky<\/em><\/a>, <\/em>Jianfeng Gao<\/em><\/a>, <\/em>Roland Fernandez<\/em><\/a>
This work investigates a unified neurosymbolic system where transformations in the network can be interpreted as both symbolic and neural computation simultaneously. It extends a unified neurosymbolic architecture.<\/p>\n\n\n\n

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition<\/strong><\/a>
Edoardo Debenedetti, Javier Rando, Daniel Paleka, Fineas Silaghi, Dragos Albastroiu, Niv Cohen, Yuval Lemberg, <\/em>
Reshmi Ghosh<\/em><\/a>, Ahmed Salem, Rui Wen, <\/em>Giovanni Cherubin<\/em><\/a>, <\/em>Santiago Zanella-B\u00e9guelin<\/em><\/a>, Robin Schmid, Victor Klemm, Takahiro Miki, Chenhao Li, Stefan Kraft, Mario Fritz, Florian Tramer, Sahar Abdelnabi, Lea Sch\u00f6nherr<\/em>
This report summarizes insights from a capture-the-flag competition at IEEE SaTML 2024, which highlighted the challenges in defending large language model systems against malicious message attacks.<\/p>\n\n\n\n

Diffusion for World Modeling: Visual Details Matter in Atari<\/strong><\/a>
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, <\/em>
Tim Pearce<\/em> (opens in new tab)<\/span><\/a>, Fran\u00e7ois Fleuret<\/em>
This work presents DIAMOND (diffusion as a model of environment dreams), an open-source reinforcement learning agent trained in a diffusion world model.<\/p>\n\n\n\n

DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents<\/strong><\/a>
Peter Alexander Jansen, <\/em>
Marc-Alexandre C\u00f4t\u00e9<\/em><\/em><\/a>, Tushar Khot, Erin Bransom, Bhavana Dalvi, Bodhisattwa Prasad Majumder, Oyvind Tafjord, Peter Clark<\/em>
DISCOVERYWORLD is an open-source virtual environment for developing and benchmarking an agent’s ability to perform complete scientific discovery cycles, with 120 diverse tasks across diverse topics.<\/p>\n\n\n\n

Efficient Adversarial Training in LLMs with Continuous Attacks<\/strong>
<\/a>Sophie Xhonneux, <\/em>
Alessandro Sordoni<\/em><\/a>, Stephan G\u00fcnnemann, Gauthier Gidel, Leo Schwinn<\/em>
This research introduces an efficient approach to adversarial attacks by calculating them in the LLM\u2019s continuous embedding space.<\/p>\n\n\n\n

Generalized Linear Bandits with Limited Adaptivity<\/a><\/strong>
Ayush Sawarni, Nirjhar Das, Siddharth Barman,
Gaurav Sinha<\/a><\/em>
This paper studies the generalized linear contextual bandit problem under limited adaptivity and introduces two algorithms, B-GLinCB and RS-GLinCB, to address two prevalent settings.<\/p>\n\n\n\n

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions<\/strong><\/a> 
Minghan Li, Heng Li, Zhi-Qi Cheng, Yifei Dong, Yuxuan Zhou, Jun-Yan He, <\/em>
Qi Dai<\/em><\/a>, Teruko Mitamura, Alexander G. Hauptmann <\/em>
This research introduces Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions.<\/p>\n\n\n\n

Identifying Equivalent Training Dynamics<\/strong>
<\/a>William T. Redman, <\/em>
Juan M. Bello-Rivas<\/em> (opens in new tab)<\/span><\/a>, M. Fonoberova, Ryan Mohr, I. Kevrekidis, Igor Mezi\u0107<\/em>
Using advances in Koopman operator theory, the authors developed a framework for identifying conjugate and nonconjugate training dynamics.<\/p>\n\n\n\n

Implicit Curriculum in Procgen Made Explicit<\/strong>
<\/a>
Kaixin Wang<\/em> (opens in new tab)<\/span><\/a>, Xinchao Wang<\/em>
This work investigates the learning process itself under the multi-level training in Procgen, which exhibits a gradual shift from easy to hard contexts, suggesting an implicit curriculum in multi-level training.<\/p>\n\n\n\n

Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning<\/strong>
<\/a>
Dylan J. Foster<\/em><\/a>, <\/em>Adam Block<\/em><\/a>, Dipendra Misra<\/em>
The authors show they can achieve horizon-independent sample complexity in offline imitation learning when the range of the cumulative payoffs and an appropriate notion of supervised learning complexity for the policy class are controlled.<\/p>\n\n\n\n

MInference: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention<\/a><\/strong>
Huiqiang Jiang<\/a>, Yucheng Li, Chengruidong Zhang<\/a>, Qianhui Wu<\/a>, Xufang Luo<\/a>, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li<\/a>, Chin-Yew Lin<\/a>, Yuqing Yang<\/a>, Lili Qiu<\/a><\/em>
MInference is sparse calculation method designed to accelerate pre-filling of long-sequence processing, identifying three unique patterns in long-context attention matrices that can be used for efficient sparse computation on GPUs.<\/p>\n\n\n\n

The Power of Resets in Online Reinforcement Learning<\/strong>
<\/a>Zakaria Mhammedi,
Dylan J. Foster<\/a>, Alexander Rakhlin<\/em>
This study explores the potential of simulators through reinforcement learning with local simulator access, an RL protocol where the agent is allowed to reset to previously observed states and follow their dynamics during training.<\/p>\n\n\n\n

VideoGUI: A Benchmark for GUI Automation from Instructional Videos<\/strong><\/a>
Kevin Qinghong Lin, <\/em>
Linjie Li<\/em><\/a>, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, <\/em>Lijuan Wang<\/em><\/a>, Mike Zheng Shou<\/em>
This research introduces VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks through a hierarchical process, allowing for identification of the specific levels at which they may fail.<\/p>\n\n\n\n

Voila-A: Aligning Vision-Language Models with User\u2019s Gaze Attention<\/strong>
<\/a>Kun Yan, <\/em>
Lei Ji<\/em><\/a>, Zeyu Wang, Yuntao Wang, Nan Duan, Shuai Ma<\/em>
The authors introduce gaze information, feasibly collected by AR or VR devices, and propose a novel approach for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.<\/p>\n\n\n\n

\n

Explore Microsoft’s 100+ accepted papers<\/a><\/strong><\/p>\n<\/blockquote>\n\n\n\n


\n\n\n
\n

Microsoft at ML4H 2024<\/h3>\n<\/div>\n\n\n

Co-located with NeurIPS is the AHLI Machine Learning for Health (ML4H) Symposium<\/a>, an event that unites machine learning researchers, clinicians, and healthcare data experts to advance AI applications in healthcare. Microsoft’s contribution of four papers to this symposium underscores its commitment to improving medical imaging and clinical workflows through AI, focusing on accuracy, efficiency, and interpretability.<\/p>\n\n\n\n

\n
Accepted papers<\/a><\/div>\n<\/div>\n\n\n\n
\n<\/div>\n\n\n\n
<\/div>\n<\/div>\n\n\n\n
<\/div>\n\n\n\n
\n
\n

Other resources<\/h3>\n<\/div>\n\n\n\n
\n
\n
\n

NeurIPS 2024 booth schedule<\/a><\/p>\n\n\n\n

NeurIPS 2024 career opportunities<\/a><\/p>\n\n\n\n

ML4H 2024 accepted papers<\/a><\/p>\n<\/div>\n\n\n\n

\n

Microsoft Research Podcast<\/a><\/p>\n\n\n\n

Microsoft Research Blog<\/a><\/p>\n\n\n\n

Microsoft Research Forum series<\/a><\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n

<\/div>\n<\/article>\n","protected":false},"excerpt":{"rendered":"

We\u2019re excited to be a part of #NeurIPS2024! Explore the future of AI with over 100 groundbreaking papers, including oral and spotlight sessions, on reinforcement learning, advanced language model training, and multilingual, culturally inclusive benchmarks.<\/p>\n","protected":false},"featured_media":1110402,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13561,13556,13562,13551,13554],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-1109289","msr-story","type-msr-story","status-publish","has-post-thumbnail","hentry","msr-research-area-algorithms","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-research-area-graphics-and-multimedia","msr-research-area-human-computer-interaction","msr-locale-en_us"],"related-researchers":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-projects":[],"related-groups":[],"related-events":[],"related-posts":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-story\/1109289","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-story"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-story"}],"version-history":[{"count":89,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-story\/1109289\/revisions"}],"predecessor-version":[{"id":1111377,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-story\/1109289\/revisions\/1111377"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1110402"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1109289"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1109289"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1109289"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1109289"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}