Research Forum Brief | February 2025 Articles

LLMs for safe low-level programming

Microsoft Research Team — Tue, 25 Feb 2025 19:38:26 +0000

Presented by Aseem Rastogi and Pantazis Deligiannis at Microsoft Research Forum, February 2025

“We created a tool called RustAssistant that leverages the power of state-of-the-art LLMs to help developers by automatically suggesting fixes for Rust compilation errors.”
– Pantazis Deligiannis, Principal Research Engineer, Microsoft Research FoSSE (Future of Scalable Software Engineering)

Microsoft research copilot experience How are LLMs transforming programming practices, enhancing safety and efficiency?

Transcript: Lightning Talk

LLMs for safe low-level programming

Aseem Rastogi, Principal Researcher, Microsoft Research FoSSE (Future of Scalable Software Engineering)
Pantazis Deligiannis, Principal Research Engineer, Microsoft Research FoSSE (Future of Scalable Software Engineering)

This talk covers two technical results from ICSE 2025 on using large language models (LLMs) for safe low-level programming. The results demonstrate LLMs inferring machine-checkable memory safety invariants in legacy C code and how LLMs assist in fixing compilation errors in Rust codebases.

Microsoft Research Forum, February 25, 2025

FRIEDERIKE NIEDTNER, Principal Technical Research Program Manager, Microsoft Research AI Frontiers: The following talk combines two projects that both harness the LLM’s capabilities to understand and produce code. Both aim to help developers tackle the difficulties of safe low-level programming. One to ensure memory safety in legacy C code; the other presents RustAssistant, a tool for developers to automatically fix compilation errors in Rust.

ASEEM RASTOGI: Hi, my name is Aseem Rastogi, and I’m a researcher in the Future of Scalable Software Engineering organization in Microsoft Research. I’m going to talk to you about our paper, “LLM Assistance for Memory Safety.” This paper will be presented at the 47th International Conference on Software Engineering in May later this year.

The lack of memory safety in low-level languages like C and C++ is one of the leading causes of software security vulnerabilities. For instance, a study by Microsoft estimated that 70% of the security bugs that Microsoft fixes and assigns a CVE every year are due to memory safety issues. Researchers have proposed safe dialects of C, for example, Checked C, that—with the help of additional source-level annotations—provide memory safety guarantees with low performance overheads. However, the cost of adding these annotations and the code restructuring required to enable them becomes a bottleneck in the adoption of these tools. In general, application of formal verification to real software faces the same challenge.

In our paper, we explore the use of pretrained large language models to help with the task of code restructuring and inferring source annotations required to adopt Checked C. Let’s consider an example that takes an array of integers as input and sums the first n elements. To reason about the memory safety of this function, Checked C requires an annotation on p. One such annotation is as shown here. This tells the compiler that p is an array with at least n elements, which is enough to ensure the safety of memory accesses in this function. It also helps impose an explicit obligation on the callers of this function that they must pass an appropriately sized array to it.

Our goal is to infer such annotations with the help of LLMs. For this problem, LLMs seem like a perfect match. It is hard to encode reasoning about real-world code and complex code patterns in symbolic tools. LLMs, on the other hand, have demonstrated tremendous code comprehension and reasoning capabilities similar to what programmers have, even for real-world code. Second, LLM hallucinations might lead to incorrect annotations, but they cannot compromise memory safety. Once the annotations are added to the code, the Checked C compiler guarantees memory safety even when the annotations are incorrect. This way, we get best of both worlds!

However, working with LLMs for whole program transformations in large codebases represents another challenge. We need to break the task into smaller subtasks that can fit into LLM prompts while adding relevant symbolic context to each prompt. Put another way, in order for LLMs to be able to reason like programmers, we need to provide them context that a programmer would otherwise consider. Our paper presents a framework for doing just that with the help of program dependence graphs working in tandem with LLMs. We implement our ideas in a tool called MSA and evaluate it on real-world codebases ranging up to 20,000 lines of code. We observe that MSA can infer 86% of the annotations that state-of-the-art symbolic tools cannot. Although our paper focuses on memory safety, our methodology is more general and can be used to effectively leverage LLMs for scaling the use of formal verification to real software—most importantly, doing so without compromising on the soundness guarantees. We are really excited about this research direction.

Up next, my colleague Pantazis will tell you about how we are leveraging LLMs to make it easier for the programmers to adopt Rust. Thank you.

PANTAZIS DELIGIANNIS: Hello, everyone. I’m Pantazis, and today I will be presenting our work on leveraging the power of large language models for safe low-level programing. Specifically, I will focus on our recent paper about RustAssistant, which is a tool that uses LLMs to automatically fix compilation errors in code written in Rust. This work was done together with other individuals that are listed on the screen and will appear in the International Conference on Software Engineering later this spring.

OK, let’s dive in! Why do we care about safe low-level programing with Rust? So the Rust programing language, with its memory and concurrency safety guarantees, has established itself as a viable choice for building low-level software systems over the traditional, unsafe alternatives like C and C++. These guarantees come from a strong ownership-based type system, which enforces memory and concurrency safety at compile time. However, Rust poses a steep learning curve for developers, especially when they encounter compilation errors related to advanced language features such as ownership, lifetime, or traits. At the same time, Rust is becoming increasingly more popular every year, so as more and more developers adopt Rust for writing critical software systems, it is essential to tackle the difficulty in writing code in Rust.

In Microsoft Research, we created a tool called RustAssistant that leverages the power of state-of-the-art LLMs to help developers by automatically suggesting fixes for Rust compilation errors. Our tool uses a careful combination of prompting techniques as well as iteration between a large language model and the Rust compiler to deliver high-accuracy fixes. RustAssistant is able to achieve an impressive peak accuracy of roughly 74% on real-world compilation errors in popular open-source Rust repositories on GitHub.

OK, let’s now see how RustAssistant works step by step. Let’s begin with the first step: building the code and parsing the build errors. Such errors can range from simple syntax mistakes to very complicated issues involving traits, lifetimes, or ownership rules in Rust code spread across multiple files. So when a developer writes Rust code that doesn’t compile, the Rust compiler generates detailed error messages that include the error code, the location of the error, as well as documentation and examples related to this error code.

To illustrate this process, let’s look at this very simple example on the screen. In this case, the developer is trying to compare a custom VerbosityLevel enumeration in their code using the greater-or-equal operator. However, the Rust compiler throws an error, stating that this binary operation cannot be applied to VerbosityLevel. The compiler suggests that the reason behind this error is because VerbosityLevel does not implement a trait that is required for performing such comparisons in Rust. This detailed error message is precisely what RustAssistant captures at this step, preparing it for the next stage of processing.

At the next step, RustAssistant takes this detailed error information that is generated in the previous step and focuses on extracting the specific parts of the code that are directly relevant to this error. Looking at the example on the screen, the code snippets related to the enumeration and its use in the log_error function are automatically extracted by our tool. This includes not only the problematic line of code but also other code snippets that provide necessary context for understanding and resolving the error. The tool also captures the error details, such as the error code and the accompanying compiler suggestion about the missing trait for performing the comparison. These extracted code snippets and error details are then packaged into a prompt for the LLM. This ensures that the LLM receives only the essential information required to suggest an accurate fix without being overwhelmed by irrelevant parts of the codebase. This careful localization step is crucial for both efficiency and accuracy, especially when dealing with very large codebases.

Now let’s move to the last step. Here, RustAssistant sends the carefully localized prompt, which includes the error details and the relevant code snippets, to the large language model API. The LLM generates a proposed fix, formatted as a code diff—in other words, does not include the entire code snippet for efficiency but only the new, edited, or deleted code lines. For example, in the case of our build error, the LLM suggests adding the missing traits to the enumeration, as shown here on the screen. This fix ensures that the comparison using the greater-or-equal operator will now work as intended. Next, RustAssistant parses this suggested fix and applies the changes to the appropriate file in the codebase. Once the fixes are applied, our tool runs again the Rust compiler to verify if the build error has been resolved. If the code compiles, then great news! The process is now complete, and we can do further validations like running any unit tests.

However, if new errors appear or if the fix doesn’t fully resolve the issue, RustAssistant sends the updated context back to the LLM, iterating until the code compiles error free. And this iterative process allows our tool to handle complex, multi-step fixes while ensuring correctness and alignment with the developer’s intent. Of course, the example that I showed here is a very simple one, but you can imagine the tool being able to fix much more complicated build errors.

To summarize, I presented a quick walkthrough of how RustAssistant can be used to help developers automatically fix build errors in their Rust codebases. In our paper, we evaluated RustAssistant on the top hundred Rust repositories on GitHub and showed that it can achieve an impressive peak accuracy of roughly 74% on real-world compilation errors. We invite you to read our ICSE paper as it not only discusses the evaluation results in detail but also dives into interesting technical details, such as how we designed our prompts as well as various techniques that we developed for scaling RustAssistant on very large codebases without losing accuracy.

Thank you for listening.

Group Systems | India

Publication RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code

Publication LLM Assistance for Memory Safety

The post LLMs for safe low-level programming appeared first on Microsoft Research.

AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness

Microsoft Research Team — Tue, 25 Feb 2025 19:36:13 +0000

Presented by Gagan Bansal at Microsoft Research Forum, February 2025

“When we released AutoGen, one of the first things that the developers absolutely loved about it was its simplicity and the many pre-built agents and teams that it provided, such as the user proxy agent and the assistant agent, and the group chat between multiple agents. With the AutoGen AgentChat layer, we are maintaining these features and adding tons of more essential features such as streaming support, serialization, state management and memory for agents, and finally full-time support for a better development experience.”
– Gagan Bansal, Senior Researcher, Microsoft Research AI Frontiers

Microsoft research copilot experience How has the new update to the AutoGen framework enhanced its capabilities for agentic AI applications?

Transcript: Lightning Talk

AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness

Gagan Bansal, Senior Researcher, Microsoft Research AI Frontiers

This talk introduces a transformative update to the AutoGen framework that builds on user feedback and redefines modularity, stability, and flexibility to empower the next generation of agentic AI research and applications.

Microsoft Research Forum, February 25, 2025

FRIEDERIKE NIEDTNER, Principal Technical Research Program Manager, Microsoft Research AI Frontiers: The following talk invites us to follow the journey of AutoGen (opens in new tab) from a leading open-source framework for multi-agent applications to a complete redesign that lays the foundation for the future of agentic AI research and applications with the release of AutoGen 0.4 (opens in new tab). The framework’s new layered architecture provides flexibility and scalability and includes an ecosystem of extensions and applications, some created by the same team, such as Magentic-One, a team of generalist agents, and Studio, a low-code developer tool. AutoGen 0.4 is also a story about collaboration between MSR, partners within Microsoft, and a vibrant open-source community.

GAGAN BASAL: Hi, I am Gagan Bansal and I am a researcher at Microsoft Research AI Frontiers. And today I’ll talk about some exciting technical updates to AutoGen, a leading open-source framework for agentic AI. And although I am presenting, this is joint work with many incredible colleagues and interns at Microsoft over the last year.

AutoGen is a leading open-source framework for multi-agent applications that we released in fall 2023. It enables developers and researchers to create intelligent applications using large language models, tool use, and multi-agent collaboration patterns. With AutoGen, our goal has been to lead the innovation in agentic AI research. When we first launched AutoGen in Fall 2023, it quickly became the leading open-source framework for agentic AI, and it continues to empower developers and researchers in many, many domains, including business process automation, marketing, finance, security, and others.

Since AutoGen’s launch, we’ve not just been maintaining it. We’ve been listening closely to feedback from developers and researchers, and in this rapidly evolving landscape of AI progress, their expectations were high. Users told us that they needed greater modularity and the ability to reuse agents seamlessly. They also asked for better support for debugging and scaling their agentic solutions. And finally, there were many apps to enhance the code quality and maturity of the platform.

Pursuing these needs required us to question our assumptions and even possibly reimagine the platform. So, in early 2024, we used these learnings to experiment with alternate architectures, and we ended up adopting an actor model for multi-agent orchestration. The actor model is a well-known programming model for concurrent programing and high use systems. Here, actors are the computational building blocks that can exchange messages and also perform work. In Fall 2024, we announced a preview of this version and this new year, we’re thrilled to announce a full release. In summary, AutoGen v0.4 is our response to address our users’ feedback in this evolving landscape of AI research. AutoGen is now not just a framework, but it’s a whole ecosystem for agentic AI. It provides you with a framework that lets you build sophisticated agents and multi-agent applications, and it also provides you with developer tools and many well-defined applications.

Let me first tell you about the AutoGen framework. At the heart of this release is a layered architecture that is designed for flexibility and scalability. At the base is AutoGen Core. This layer implements the actor model for agents. Building on core is AutoGen AgentChat. This layer provides a simple and easy to use API that is perfect for rapid prototyping. And building on Core and AgentChat is Extensions.

This layer provides advanced clients, agents and teams, and integrations with third party software. This layered architecture is nice because whether you are an advanced developer or a researcher prototyping new ideas, AutoGen provides you with the tools you need for your project’s stage of development. The Core implements an actor model for agentic AI. At the highest level, this implementation provides two key features.

The first is asynchronous message exchange between agents. It does so by providing a runtime, and then it also provides event-driven agents that perform computations in response to these messages. There are several implications of this design, and one of them is that it decouples how the messages are delivered between the agents from how the agents handle them. This naturally improves the modularity and scalability of agentic workflows built with AutoGen, especially for deployment.

The Core’s event-driven architecture provides several other benefits. For example, it provides affordances to observe and control agent behavior, which is crucial for responsible development of agentic technology. It also enables running multiple agents on different processes and even implementing them using different languages. Finally, it enables developers to implement a large class of multi-agent patterns, including static and dynamic workflows.

When we released AutoGen, one of the first things that the developers absolutely loved about it was its simplicity and the many pre-built agents and teams that it provided, such as the user proxy agent and the assistant agent, and the group chat between multiple agents. With the AutoGen AgentChat layer, we are maintaining these features and adding tons of more essential features such as streaming support, serialization, state management and memory for agents, and finally full-time support for a better development experience.

Please check out the link below for the migration guide. Finally, the Extension layers provide advanced runtimes, tools, clients, and ecosystem integrations that continuously expand the framework’s capabilities. In addition to the framework, this new release also provides upgrades to essential developer tools and applications built using AutoGen. And here I’ll briefly mention two of them. In late 2023, we also released AutoGen Studio, which is a low code tool for authoring multi-agent applications.

And we are excited to announce that with version 0.4, Studio has received massive upgrades. It now supports a drag and drop, multi-agent builder. It supports real time updates as agents solve tasks, flow visualizations and execution controls, so that the users remain in control, and component galleries so that the community can discover and build on each other’s work. We’ve always believed that the framework should enable state-of-the-art applications for solving complex tasks with agents, which is why we’ve been building applications with the framework ourselves and using that to guide the framework’s development.

Last year, we released Magentic-One, a state-of-the-art multi-agent team for solving file- and web-related tasks built using AutoGen. And now its developer API, and general capabilities, such as sophisticated orchestrators and specialized agents such as the web server and the file server, are now available in the AutoGen ecosystem. For us, this new ecosystem is only the beginning and sets the stage for future innovation in agentic AI.

Over the past two years, our team has made early progress in AI agents and we continue to deeply think about the changing landscape of current AI research and continue to invest in taking steps to help lead the innovation on agents. And by the way, we’re also working closely with our colleagues at Semantic Kernel, to provide an enterprise ready multi-agent runtime for AutoGen.

Thank you for attending Microsoft Research Forum. Please check out these links to learn more about AutoGen.

Blog AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness

Publication AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Publication Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

The post AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness appeared first on Microsoft Research.

Belief state transformers

Microsoft Research Team — Tue, 25 Feb 2025 19:33:38 +0000

Presented by John Langford at Microsoft Research Forum, February 2025

“That ability to condition on generation, rather than evaluate the generation, ends up being amazingly useful in terms of giving you a more honest valuation of the generated text.”
– John Langford, Partner Research Manager, Microsoft Research AI Frontiers

Microsoft research copilot experience What advantages do belief state transformers offer over traditional GPT-style models in planning and decision-making tasks?

Transcript: Lightning Talk

Belief state transformers

John Langford, Partner Research Manager, Microsoft Research AI Frontiers

This talk showcases a new transformer architecture that generates compact belief states for goal-conditioned planning, enhancing planning algorithms’ efficiency and effectiveness.

Microsoft Research Forum, February 25, 2025

FRIEDERIKE NIEDTNER, Principal Technical Research Program Manager, Microsoft Research AI Frontiers: Transformer models have brought us a revolution in language modeling with their capability to generate impressive language with many emergent properties. At the same time, LLMs have a number of weaknesses, one being that they are not very good at evaluating their own output. Let’s hear how the new Belief State Transformer architecture unlocks new abilities by combining a standard GPT-style architecture of a forward encoder for token prediction with an additional backward encoder.

JOHN LANGFORD: I’m John Langford. I’d like to tell you about belief state transformers, which is a new paper we have in archives, and which is also accepted at ICLR [International Conference on Learning Representations]. There are many coauthors on this paper. I’d like to thank them, particularly Edward, who did much of the work here.

To start with, let’s talk about standard GPT-style transformers. In standard GPT style transformers, you have a sequence of symbols which are going into a forward encoder, and then the forward encoder outputs some information to the output head, and then the output head predicts the final token. So, this is a straightforward approach and yet amazingly powerful. It’s kind of the key backbone behind GPT-4 and other language models.

For the purposes of research, though, we need to have something to think about, to complain about, and I’m going to complain about self-evaluation. Often these language models can’t be used to evaluate their own output too well, because the generation of the next token is done by exactly the mechanism you would use to evaluate it in that output head. So, this is kind of like grading yourself, and like grading yourself you can miss things that an independent grader would actually see pretty well.

Right, so a belief state transformer changes the architecture. And so, it’s taking two transformers and grafting them together. One of them is going to be the standard forward encoder on the prefix. And then we’re also going to have another transformer, which is a backward encoder on the suffix. These are both going to put out some information, which goes to the output head. And the output head is going to predict the next token and the previous token. So, it’s the next token of the prefix and the previous token of the suffix. Something to worry about with these transformers is the computation. So, these are transformers obviously doing more computation. But it turns out that this “more computation” is only in a constant factor of more computation.

And the key observation here is that in the forward encoder, just doing the attention, what you’re going to use in the GPT-style transformer, is already order N-squared [N²]. Every token looks at every previous token in order to figure out what information is necessary to predict the next token. In the belief state transformer, that happens twice. You have two different transformers, each with their own attention, and so you pay a factor of two.

And then, in addition, you’re going to pay because the number of times you evaluate the head, the output head, is order n squared because there are order N-squared prefix/suffix pairs. So, there’s a constant factor increasing computation, which is problematic, but it’s not like the end of the world. You can subsample or things like that. And what you get in return is order N-squared gradients rather than order N gradients.

In a standard GPT-style transformer, you only have order N gradients because you only have order N symbols, and you get one gradient per symbol. Here you get order N-squared gradients because you have order N-squared prefix/suffix pairs. That means there’s many more ways to get information out of a sequence. And that unlocks the possibility of learning new things that were previously unlearnable.

Okay, so now let’s go on to the belief state. Why are we talking about a belief state when we say belief state transformer. Well, it turns out you can prove a theorem. And this theorem says that the output of the forward encoder is a belief state for the prefix. So what that means is that the output of the forward encoder will converge to all the information necessary to predict the future. So that’s all symbols after the prefix. So, that ability to create a compact belief state is new with belief state transformers, something that previously we only really knew how to do with state space machines.

Okay, so let’s try this out. Looking at Tiny Stories. Tiny Stories is dataset where you have a bunch of children stories, which are generated by GPT-4.

We’re going to feed a prefix and a suffix into our system, and it’s going to fill in the middle, which is what happens in blue. And then for a baseline, we’re going to compare the fill-in-the-middle approach to using GPT-style transformers. So the way the fill-in-the-middle approach works with GPT-style transformers is you take the prefix, and then you add the suffix, and then you just predict the tokens after that.

So that works reasonably well. This is very commonly used. And now if we have these two different approaches the question is how do we actually value these different approaches? Which one is better? So, the way we’re going to judge this is we’re going to ask GPT-4 which is better in various ways: syntax, style, and so forth. And then we’ll ask it for a summary judgment, which is a standard technique.

We looked at what it was doing, and it seemed very reasonable. And in doing this, we end up with the belief state transformer winning about a factor of three more often than the GPT-style transformer. So that’s huge. It’s so huge that you really want to understand why. And it seems like the key here is self-evaluation. So, under the hood, we’re actually running each of these, say 120 times, using a beam search. The code for that is on the right. So, given the beam search, you have several different possible completions. And now how do you choose which completion to actually use? Because you have to pick one of these. You’re trying to pick a completion. And for the GPT-style transformer, there’s only one way to really do this. The way is you take the next head, and you use it as a probability function, and you look at the probability of the sequence of tokens which is produced.

That works reasonably well. It actually does improve picking out a high-probability sequence of tokens versus a lower probability sequence of tokens. But it’s not as much as you get with the belief state transformer. And the reason why is the self-grading issue that I was talking about earlier. There’s many ways that a system could be blind to its own mistakes. With the belief state transformer, though, you have another option, because the next head can instead condition on the generated data and run over the suffix in order to value the generated data.

So, that ability to condition on generation, rather than evaluate the generation, ends up being amazingly useful in terms of giving you a more honest valuation of the generated text. All right, so just to summarize, we have this belief state transformer. This learns a compact belief state, which is a new thing in transformers. It gives us a way to have a simple set of values, which summarize all information we need to predict the future.

And this seems to provide a very strong form of self-evaluation, which is potentially very useful in many situations where you’re trying to use test-time compute, or even using test-time compute to further create training data. So, this is more in the paper. There’s some other things that you can do with transformer that are kind of new.

I think the biggest question in my mind is what happens when you scale this up? And, of course, we’re working on that. That’s one of the great things about being in MSR [Microsoft Research]. They have some GPUs to scale this up to much larger datasets. So, stay tuned. And, thank you.

Research Lab AI Frontiers

Publication Learning to Achieve Goals with Belief State Transformers

The post Belief state transformers appeared first on Microsoft Research.

Magma: A foundation model for multimodal AI agents

Microsoft Research Team — Tue, 25 Feb 2025 19:31:20 +0000

Presented by Jianwei Yang at Microsoft Research Forum, February 2025

“In this project we developed the first agentic foundation model, Magma, that can understand multimodal input and also take action in both digital and physical environments.”
– Jianwei Yang, Principal Researcher, Microsoft Research Redmond

Microsoft research copilot experience How does Magma, as a multimodal agentic foundation model, enhance interaction in both digital and physical environments?

Transcript: Lightning Talk

Magma: A foundation model for multimodal AI agents

Jianwei Yang, Principal Researcher, Microsoft Research Redmond

This talk introduces Magma, a new multimodal agentic foundation model designed for UI navigation in digital environments and robotics manipulation in physical settings. It covers two new techniques, Set-of-Mark and Trace-of-Mark, for action grounding and planning, and details the unified pretraining pipeline that learns agentic capabilities.

Microsoft Research Forum, February 25, 2025

FRIEDERIKE NIEDTNER, Principal Technical Research Program Manager, Microsoft Research AI Frontiers: The following talk introduces Magma, an agentic foundation model, meaning a generalist model that has agentic abilities like perceiving its environment, reasoning, and taking actions to achieve goals. Magma can understand multimodal inputs and predict actions for real-world goals in both the digital and physical world.

JIANWEI YANG: Welcome everyone. My name is Jianwei Yang. I’m a researcher from MSR [Microsoft Research] Deep Learning Group and very excited to talk about Magma, the most recent work to build the foundation for multimodal AI agents.

When talking about multimodal agents, I would like to walk us through the multimodal models people have built in the past five years. Five years ago, vision-language models or multimodal models were mostly built on top of the BERT architecture. Typically, these models contain less than 1 billion parameters, and the training data is usually a small amount of images. Later on, the CLIP model came out from OpenAI. It scaled up their multimodal training to billions of images. Back then, we built our own multimodal foundation called Florence. Although the modal size is still relatively small, it shows strong open vocabulary and zero-shot recognition capability across a range of visual domains.

Most recently, we entered the era of large multimodal models. Connecting multimodal vision models, such as CLIP, with large language models, such as GPT, incurs many advanced and multimodal capabilities. Now we can have a multimodal chatbot such as GPT-4o or small 53cv, which can see, talk, and reason.

Nowadays, most of the existing multimodal models are built to make a good sense of the world. They still lack the ability to interact with the world, either virtually or physically. They cannot directly interact with the world as their inputs are captured by different sensors and then detached between the environment and the large foundation models. We believe that a multimodal AI model should not only understand the inputs but also interact with the environment as an agent in a human-like manner.

However, nowadays, we are still facing a big gap between AI and human in performing tasks as simple as web navigation and manipulation. With this in mind, we developed Magma, a foundation model for multimodal agents. We are striving for a single foundation model, which is a large multimodal model that can understand the visual and textual inputs, and also predict actions for a real-world goal.

The whole model is pretty simple and straightforward. As you can see, it follows common design and takes image, video and a task prompt as inputs and then generates textual, spatial, and action as outputs for different tasks. The goal is to create a generalizable system capable of performing a varied range of agentic tasks in both digital and physical environments.

As we well know, pretraining larger foundation models requires large-scale data. In this project, we explore a new way of leveraging a valid range of human instructional videos for our model pretraining. Consider that temporal motions in this video data are used for supervision for action grounding and pretraining. Below are four sample videos and the corresponding object motions. As you can see, the motion represented by the object trajectory can clearly indicate the action taken by humans and the robot.

However, the raw vision in the video cannot be directly used, as they are usually very noisy and do not necessarily capture the meaningful object in the scenarios. We need a way to convert the motions to meaningful action for agentic models to learn. To achieve this goal, we introduce two techniques: Set-of-Mark for images and Trace-of-Mark for videos and robot data. Set-of-Mark is our earlier proposed method, which has been widely used by the community for UI and robotics tasks as it helps to ground the agent action spatially into the images.

Trace-of-Mark, on the other hand, is our newly developed method to capture the motions of foreground objects. The resulting traces, which with the actions, are shown at the bottom. In the end, we compared roughly 20 million training samples, which contains images, video data, and also robotics data. Each of them serve slightly different goals. Given the pretraining data, we use a unified pretraining objective, which is similar to pretraining a large language model.

More specifically, our model takes text data as input, and then predicts verbal, spatial, and action outputs. During the pretraining, we prompted a model for action planning. At the top, we compare different numbers of pre-training data. As we can see, the more data we use for the pre-training, the better our model is for action grounding and planning. At the bottom, we prompt the model with different task prompts. It shows good generalization ability across tasks given the same image input.

After the whole pretraining, we evaluated our model in zero-shot manner on different tasks. From left to right, we evaluated on spatial grounding, digital UI navigation, and physical robot manipulation. Our Magma model shows advantages over the counterpart methods, including GPT-4v. Note that our model is the first and only model that can perform all three agentic tasks simultaneously.

Given the pretrained Magma model, we can configure it for robotics manipulation. Using the same amount of robot data as OpenVLA, the Magma model, almost doubles the performance in different simulated environments. This indicates the effectiveness of our pretraining techniques and the potential of leveraging unlabeled image and video data for agentic pretraining. and lab of the video data for pre-training. Afterwards, we further fine-tune our model for real-world robot manipulation and UI navigation.

At the top, we tested both seen and unseen tasks, and Magma showed much better performance compared with OpenVLA, though both methods are fine-tuned in exactly the same way. In the bottom table, we compare Magma with other methods in a more realistic UI navigation benchmark called Manage-to-Work. Using only image data as input, our Magma model achieved state-of-the-art performance in terms of the success rate.

To summarize, in this project we developed the first agentic foundation model, Magma, that can understand multimodal input and also take action in both digital and physical environments. Considering the limited amount of pretraining data, we proposed two techniques, Set-of-Mark and Trace-of-Mark, to leverage large amounts of images and videos without human labels for model pretraining.

In the end, we get a very compatible foundation model for a wide range of multimodal tasks, including both understanding and action prediction. We have released our code and model. Feel free to try it out by yourself. At last, I want to highlight that this is a joint work by many teammates in the Deep Learning group and also MSR [Microsoft Research] as well as many external collaborators.

Thank you all for your attention.

Research Lab Microsoft Research Lab – Redmond

Download Magma

Publication Magma: A Foundation Model for Multimodal AI Agents

Blog Magma: A foundation model for multimodal AI agents across digital and physical worlds

Project Deep Learning and Representation Learning

The post Magma: A foundation model for multimodal AI agents appeared first on Microsoft Research.

Chimera: Accurate synthesis prediction by ensembling models with diverse induction biases

Microsoft Research Team — Tue, 25 Feb 2025 19:24:44 +0000

Presented by Marwin Segler at Microsoft Research Forum, February 2025

“We think, in the future, predictive synthesis will really help chemists to accelerate the discovery of new essential molecules.”
– Marwin Segler, Principal Researcher Manager, Microsoft Research AI for Science

Microsoft research copilot experience How might the use of AI models in synthesis prediction impact the future of drug discovery?

Transcript: Lightning Talk

Chimera: Accurate synthesis prediction by ensembling models with diverse induction biases

Marwin Segler, Principal Researcher Manager, Microsoft Research AI for Science

This talk addresses chemical synthesis in drug discovery with a learning-to-rank framework that integrates AI-based models, significantly boosting prediction accuracy and preferred by chemists.

Microsoft Research Forum, February 25, 2025

WILL GUYMAN, Group Product Manager, Healthcare AI Models: To design a new medication to defeat an illness, scientists need to predict which blend of molecules can be transformed into medicines, a tedious process that traditionally takes decades and can cost billions. Researchers at Microsoft Research and Novartis have been developing a novel approach to addressing a major bottleneck in this process called retrosynthesis, figuring out how to start from a target molecule and plan the chemical steps needed to make it. In practical terms, that means cutting down on trial-and-error experiments, speeding up how quickly researchers can create new molecules, and ultimately lowering the time and cost needed to develop new treatments. I’ll now hand it over to Marwin, who will explain how this technology works in detail and discuss its potential impact on future drug discovery.

MARWIN SEGLER: Hi, I’m Marwin, principal researcher at Microsoft Research AI for Science. And on behalf of the team, especially Chris and Guoqing, I’m going to tell you a story about life and death. Small organic molecules are central to human well-being. As agrochemicals, they have to feed the planet; as drugs, they keep us healthy and help us, hopefully, to prevent from dying too early; and as materials, they help to improve the quality of our lives. To get access to small molecules, one needs to synthesize them in the lab via the synthesis route. And the synthesis route, you can think about it like a cooking recipe, where we start from the ingredients and then run several steps until we reach the final product. And to plan a synthesis, chemists often start with the targets that they want to make and then work their way recursively backward to the starting materials.

However, synthesis can be super challenging as reactions can fail, and also in this multi-step synthesis the errors can compound. And this is one of the reasons why small molecule drug discovery, for example, is so much slower and more expensive than protein design, where we have had so many recent breakthroughs. And AI models that could help chemists to find better synthesis routes will really have a profound impact on how small molecules are discovered and produced, with the potential to really accelerate the discovery of much-needed new functional organic molecules. So how can we address this major bottleneck? First, we need the synthesis prediction model. And this model takes in a molecule, a target molecule, and predicts a list of feasible reverse chemical reactions, basically.

And this is similar to learning how to predict the moves in a chess game. But in chess it’s relatively simple because we actually don’t really need a model, because we can just implement the rules of the game perfectly as a very simple program. But chemistry is much, much more complicated. So, we need to learn this model from actual experimental data.

We can thus think about this model that we’re learning as a chemical generative world model that predicts which reactions are feasible for a given molecule in a given situation. And once we have such a model, we can plug it into a search algorithm and recursively apply it to get a full multi-step synthesis route.

How can we model chemical reactions? We can represent molecules either as graph structures or using SMILES [Simplified Molecular Input Line Entry System]. SMILES is basically the token sequence representation of the graph and carries the same information. Now, given the target product that we want to make, we could either use an auto-regressive model to generate the SMILES sequence of the reactants de novo.

So in whole, new, token by token. If one knows about language models, it’s very similar to that. This is very appealing, because it leaves it to stochastic gradient descent to figure out this process end to end. However, it has a disadvantage, because in chemical reactions usually only a small part of the molecule changes, and with this de novo model we need to copy the whole, also unchanged parts.

So here, in this reaction, only the marked parts change. So what we could do as well is just predict these edits that we have to apply to the molecule. And these edits can be very well represented using simple rules or so-called templates, which we can just derive from the training data. We could just predict those edits.

Now how do we implement these models? The de novo prediction model, we can implement as a sequence-to-sequence model using modern transformers, using grouped multi-query attention and modern activations.

An edit-based model, we can implement via a dual GNN that includes both the product and these edit templates. And then we perform classification of which is the most appropriate templates in our database, or a collection of templates that we can apply to the molecule.

Now there is an additional complication, which is where do we apply this template in the molecules? Because there can be multiple matches. And for this we have an additional localization model that gives us calls to where the template optimally matches in the molecule. And this second one we can also train with stochastic gradient descent. Now we have the best of both worlds.

But how do we combine the outputs of these two models together? And again, we need something which is learned. And we came up with a new learning-to-rank strategy, where we have an additional model that scores the outputs that the different models provide. And then provides a rescoring, which we can then use to rerank the outputs of the model to build an extremely powerful ensemble of models. And by combining these two models of complementary inductive bias, you will see we get extremely exciting results.

But first, we need to make sure that we’re really able to check what’s going on. And the issue with chemical data is that it often has temporal bias, so it’s somewhat clustered over time. So, if we randomly split the data, we get this weird time-machine effect. So, what we did instead was to make a clean type-split of the data. So we train our model only based on reaction data from patents that was published up to 2023. And then test the models on data published from 2024 onwards.

And then as a metric, we asked the model to make 50 predictions for a test set product, and then measure how many times the model was able to recover the ground truth reactants on this data. Now, we can measure how the models are doing in different regimes and what we’re seeing with all the models that have been published for the baselines is that they tend to work super well when there’s a lot of data, as in the typical deep learning regime. However, reactions where we don’t have lots of examples in the training data are very often super important for synthesis strategy. So, you can see this here where we saw that the performance of models by the frequency of how often different reaction classes occur in the training data. And so far, this has been a major limitation of deep learning models in this domain.

Thankfully, with our model, we can basically not just outperform all the baselines that are typically used in the, in the literature, for the very frequent cases, but we can also really make progress on the classes where we don’t have that many training data and data points in our data sets. And we can even maintain very, very high performance in the cases where we just have two examples in the training set, which is usually super rare to achieve with deep learning models.

And even if we just have one example in the training data, even for the zero-shot case, we can still achieve reasonable performance whereas the baselines drop off. Yeah, completely, basically. And that’s super important for synthesis strategy.

Now, another question is how robust is the model where we move further away from the training data? And it’s very important in discovery because, by definition, we need to make predictions on new things, on new molecules that have never been made before. And we can measure that by chemical similarity. So how far the molecules in the test set are away from the training data. And existing baselines, they drop off quite a bit. But with our new ensemble, we can basically achieve a step change in how we can predict the further we go away from the data, and we can completely maintain high performance even when we move very far away from the training data, giving us a sense of the out of distribution prediction capabilities of our model.

And why is it important? Again, in drug discovery, one needs to make new molecules that have never been made before. Structurally, very new, very different. And with these improvements, we can now apply synthesis prediction with much, much more confidence to new molecules. And to give you an example of how that would look in practice, here’s a synthesis route predicted by our model for a molecule you could typically expect in a drug discovery project, which is non-trivial, so it’s quite a long sequence of steps.

And, just to give you an example of rare reaction classes, the model is able to predict these specific Hemetsberger-Knittel Indole Synthesis step, which maybe as a chemist you would not immediately think about. But the model is able to retrieve it and propose it. And in this context, it actually makes sense. So, to give you one example of how these rare reaction classes can be highly strategic.

And we think, in the future, predictive synthesis will really help chemists to accelerate the discovery of new essential molecules. And if that excites you, check out the extensive results in our paper, including validation with our great collaborators at Novartis. And up next, you’re going to hear from Jianwei Yang from the Microsoft Research AI Frontiers team to introduce Magma.

Thank you for listening.

Research Lab Microsoft Research AI for Science

Publication Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases

Download Syntheseus

The post Chimera: Accurate synthesis prediction by ensembling models with diverse induction biases appeared first on Microsoft Research.

Panel Discussion: AI for Precision Health: Learning the language of nature and patients

Microsoft Research Team — Tue, 25 Feb 2025 19:15:59 +0000

Hosted by Hoifung Poon, with Lili Qiu, Ava Amini, Carlo Bifulco, and Matthew Lungren at Microsoft Research Forum, February 2025

“For continuous health monitoring, what I see is, it’s more than just a technology advancement. It’s a paradigm shift in healthcare. So it means that healthcare is no longer constrained inside a hospital. It’s everywhere. No matter where we are, every step we move, every eye movement we make, every breath we take, every heartbeat we strike, every brainwave signal we generate is being sensed and analyzed to understand our health condition.”
– Lili Qiu, Assistant Managing Director, Microsoft Research Asia

Microsoft research copilot experience How is AI catalyzing a shift towards precision health care, and what are the key challenges and opportunities in this transformation?

Transcript: Panel Discussion

AI for Precision Health: Learning the language of nature and patients

Hoifung Poon (host), General Manager, Microsoft Research Health Futures
Lili Qiu, Assistant Managing Director, Microsoft Research Asia
Ava Amini, Senior Researcher, Microsoft Research Health Futures
Carlo Bifulco, CMO, Providence Genomics
Matthew Lungren, Chief Scientific Officer, Microsoft Health and Life Sciences

This panel discussion with Microsoft researchers and external guests explores the transformative potential of generative AI in learning the language of nature and patients for precision health, from proteins to medical imaging, from electronic medical records to home health monitoring. Emphasis is placed on the end-to-end innovation cycle from foundational research to deep partnership to productization.

Microsoft Research Forum, February 25, 2025

WILL GUYMAN, Group Product Manager, Healthcare AI Models: To talk about generative AI models and the impact they are beginning to have in the real world, Hoifung Poon will host a panel discussion with a distinguished group of experts. Representing our Microsoft Research is Ava Amini and Lili Qiu. They are joined by Microsoft’s Chief Scientific Officer in Health and Life Sciences Dr. Matthew Lungren, who is also a practicing physician and maintains research and teaching roles at Stanford University in AI and medicine. To round out a stellar panel, Providence’s Chief Medical Officer Carlo Bifulco joins to provide another voice on the real-world implementation of generative AI in clinical settings. Back to you, Hoifung.

HOIFUNG POON: Hi, everyone. Welcome to Microsoft Research Forum. Today we will have a very exciting panel discussing AI for Precision Health. As we all know, there are two fundamental challenges in today’s health care, right? So first, medicine can be very imprecise and doesn’t work for the patient, right? So for example, immunotherapy is the cutting edge cancer treatment today. And indeed, you will have some of the blockbuster drug like Keytruda? that can, you know, work miracles for some of the late-stage cancer patients. But overall, the survival is still hovering around 20, 30%, right. And then second challenge is like high-quality healthcare is just too expensive and very hard to scale. Right. So, for example, even in the U.S., 85% of cancer patients, actually are treated in this, rural or community hospital that simply doesn’t have the kind of resources and care quality like some of those more comprehensive cancer center like Mayo or Memorial Sloan, right. So and, sometime disparities in you know, simple information access, can be life or death. So what’s really exciting, these days, is that the GenAI revolution really brings us unprecedented capability. To start, … you can think about it as learning the language of nature to patients, right? And that starts to give us some very powerful tools to start accelerating biomedical discovery so we can drastically improve the healthcare quality and also, career opportunity for us to drastically reduce our healthcare costs. So now we can democratize that high-quality healthcare for everyone.

So now, of course, this undertaking is a really, kind of moonshot aspiration and really takes way more than a village. Right? So at Microsoft Research, we are super blessed with the privilege to actually have fostered deep collaboration across amazing researchers all around the globe and also with our, kind of, beloved health and life sciences product division.

And also we have key external stakeholders, such as large healthcare systems. So, you will see that, this panel really is a perfect reflection of that deep collaboration across the board. So, we will start with a quick intro by every panelist about who they are and what’s the chatter and what gets them excited these days. And also we’ll dive deep into the key opportunities and challenges. So, maybe Ava, can we start with you?

Ava Amini: Absolutely. Yeah. My name is Ava Amini, I’m a senior researcher at Microsoft Research, based in the New England Lab, and also working closely with Health Futures. My passion is really in building new AI methods to accelerate our ability to understand biology and also to design biology. And this is rooted in kind of a fundamental curiosity about bringing the power of computation and quantitative science and engineering to the biological world so that we can learn the language at which biology operates. And for me personally, I’m most interested in how behavior at the cellular level arises from the interactions and activity of individual biomolecules, how these processes become dysregulated in disease, and how we can understand those regulations and dysregulations to now develop more effective and more personalized treatments. So really thinking about the vision of bringing the power of AI to unlock new biological insight so that we can move towards our ability to create new therapies and better interventions that hopefully could be implemented in the clinic and at the patient level.

Poon: Well, that’s fascinating. Like you are learning the language of biomolecules and so far, and really the fundamental kind of biology, So Lili, do you want to go next?

Lili Qiu: Sure. I’mLili Qiu, Assistant Managing Director, Microsoft Research Asia Microsoft Research Asia, mainly responsible for leading the Shanghai Lab. Our mission is to develop a sensing and machine learning for healthcare. Complementary to Ava, our focus is to learn the language our patients outside hospitals. With the rise of chronic diseases, unpredictable health crisis and the demand for personalized treatment, traditional hospital-based monitoring is no longer sufficient.

Consider a patient with a cardiovascular killer disease who visits hospital just a few times a year. On the surface, everything seems fine; stable blood pressure, a steady heart rate, and normal respiration pattern. Yet, between these breaks, subtle flucuations can occur. Warning signs that remain invisible, until they escalate into life-threatening emergencies. For me, this scenario isn’t an abstract concept, but a painful reality. I lost both my mom and my grandmom to cardiovascular disease while they were at home. Their passing revealed how critical it is to have better monitoring tools that extend beyond hospital wards. Their memory was what motivated me to explore and develop in-home continuous monitoring tools to prevent similar heartbreaking loss and protect patients even when they are not at hospital.

Advancements in sensing and I could make it possible to track individuals’ health in real time by continuously monitoring vital signs and body movement. Health care providers can gain a deep and more comprehensive and accurate understanding of a patient’s condition. This could yield many benefits. Here I just will highlight three of them: First, we could do early detection and prevention. With continuous monitoring deviation from normal health parameters can be detected early, allowing for timely intervention before a condition gets worse.

Second, personalized and adaptive treatment plans is possible. No two patients at the same. Precision medicine requires individualized treatment plans based on real-time data. Continuous monitoring allows doctor to tailor their medication and lifestyle recommendation and therapy strategies based on actual patient behavior and physiological response over time. Moreover, we could enhance the patient engagement and quality of life while reducing healthcare costs.

When patients are equipped with real time insight into their own health, they become more engaged in their well-being. Wearable devices and mobile apps to empower individuals to make informed lifestyle choices, adhere to treatment plans, and actively participate in managing their health. This could not only improve patient outcome, but also significantly cut down healthcare costs.

Poon: Awesome. Well, that’s, I really appreciate you sharing some of those, kind of really powerful personal stories, right. So, that actually remind me so many years ago, one of my colleagues, share also a very kind of, like, inspiring story. She had, brother, right, who was diagnosed with cancer, in actually Eastern Washington, Spokane, specifically.

And they found a mutation that actually renders standard healthcare doesn’t work. And so normally the prognosis is like six months, right. So and, and my colleague persuaded him to literally drive two hours west to for a hutch. And there they found a matching trial that actually put the cancer in remission, right. So that that kind of, sort of really powerful story is like, how can we actually democratize that kind of, right, to everyone so that it, it doesn’t just, so on that. On that note, actually, Carlo, do you want to go next? Right. Because you are so really deep in the trenches.

Carlo Bifulco: Yeah. Thank you, Hoifung, indeed in the trenches. Yeah. So my background is, physician scientist, currently serving as CMO of Providence Genomics. For people unfamiliar with Providence, big healthcare network on the West Coast. And our program’s focus is really to bring precision medicine and personalized medicine to cancer patients in the context of that, kind of, very diverse network. On the research side, my work is really focused on genomics, but also spatial biology and AI. And the approach is really translational. I think we try to bring all these modalities to impact patient care in the future. And current patient care whenever we can. Specifically, for this, I think we I had the fortune of working with you all from the Microsoft research team now, I think for six years. And I think we have so much convergence. Right. I think we had a very interesting applications of NLP and AI in the real-world evidence kind of space, clinical trial matching. And we have built an ecosystem which extends to precision medicine to solutions like mobile network tumor boards and other things that really, hopefully, impact patient care.

I think most of what I’m most excited currently is really, you know, an opportunity we had recently which is focused on making a paradigm shift in pathology. We brought generative AI and large foundation models to that field. I think we were one of the first groups to do so, with a very large, open-weight model called Prov-GigaPath.

I think it’s approximately, you know, more than 1 billion parameters, 1.3 billion, if I remember correctly. You know, very impressive, we had 500,000 downloads on Hugging Face. So, super surprised by the impact. And, you know, very pleased by the impact that this is having. And for the future, you know, working on integrating all these modalities, multimodalities hopefully, with things like genomics, spatial biology and so forth.

Poon: Yeah. Thanks so much, Carlo. Obviously a disclaimer is that we have this, super privilege to be able to work with you, right. Exactly. Like when I do the math, yes years. Sorry. And oftentimes like, the challenge is like, yes, so we are all machine learning and modeling folks, right? And then we often think of our machine learning and modeling, but also there is lots of groundwork, right? How do we actually get the data, get the compute set up. Also, adhere to privacy compliance, make sure everything is responsible. And then having this super privilege to be able to really deeply collaborate with you. And also I think that also leads to a very exciting prospect, is that how can we kind of take the learning from, like, this, some of this, kind of, deep partnership, right, and really try to start to impact? Can we start to impact millions of, if not billions of patient, rights? And for that we need scaling.

So, Matt, no pressure, but really want to, get your perspective here.

Matt Lungren: It’s all about scale, Hoifung. So, I’m Matt Lungren, I’m the chief scientific officer at Microsoft Health and Life Sciences. And I’ve kind of got an interesting role. So my background as an academic physician, and, you know, spent a lot of time sort of in both the clinical world, but, you know, also in the core research world, right. And sort of bridging those gaps, kind of starting to become a pattern, I guess, in my career. But now, I kind of do the same thing. But to your point, Hoifung, it’s definitely a scaled effort. And so, we are a group of data scientists, PMs, and obviously some engineers that really are focused on working bidirectionally across both the organization and externally with partners. And what we’re looking to do is we’re saying, okay, what are the core problems in health care? Where are the gaps in some of the technology that we see today? And then what kind of innovations are happening that might, you know, fill those gaps? But the bidirectional aspect is also kind of translating some of those gaps or opportunities in the healthcare space over to some of the core innovation areas to say, hey, I think if you know, if you had if you had a technique that could actually address this, we can actually make a huge difference. And I think, I think what’s really wonderful about working in sort of this collaborative group as a great example is you’ve heard sort of from the biologic discovery, the sort of clinical translational, you’re hearing about how genomics and images kind of come into play.

There’s sort of a beautiful story here about all of these modalities kind of coming together, to your point, Hoifung, to learn the language of life or learn language biology. And I think we’re in a really interesting moment where, you know, we recognize that these powerful language models are, you know, they’re very competent across a lot of different concepts, and particularly in the tech space. And where we see the opportunity is really to translate that ability to, you know, easily access the information, process the information, potentially even enrich the information. But also leverage the data that we know has a lot more information about our patients. And that’s obviously clinically, in the imaging space. And so, I’m just delighted about the opportunities to come. And again, I think this effort that we’ve talked about, the model that the Carlo referenced is currently on the Azure catalog across many other modalities as well. And really unlocks the opportunities for the healthcare developer ecosystem to say, hey, you’ve actually got me pretty close to my goal line here with some of the things you developed, let me take it to the finish line and start to translate that at scale. And that’s exactly what we’re seeing.

Poon: Yeah, thanks. Thanks a lot, Matt. And really, amazing to see your leaderships and, and try to actually get a lot of this kind of, you know, models, right, to start being able to democratize them in AI foundries and so far to really start to put the put them in the hands, right, of folks who can start playing with those model and explore applications. So super, super exciting. So, so one theme I’m really kind of hearing and seeing is like, there is, sort of like a lot of this kind of challenges, as Matt, you just mentioned. Like in the text modality, we kind of learn relatively, quite deeply. Right. But, on the other hand, there are many biomedical modality that are non-text modality that seem to have a lot of unique challenges. So Ava, I want to kind of get back to you is like you, you and the team have done some really amazing work, right? As you mentioned, on biomolecule proteins and so forth. So, what’s your top of mind in terms of some of the key challenges that you see? And what are some of the learning you have found?

Amini: Yeah. I think for me, the what’s really top of mind connects back to this theme of bidirectionality that Matt and Carlo and Lili and yourself, Hoifung, have raised. And that’s really core for how we think about, you know, AI in this space of biology and how we actually operationalize that. So, it’s not sufficient to just develop competent models that can learn over protein sequences or protein structures or other biological modalities. What we really care about in this next phase is how do we bring those models to translation to unlock real biological discovery. And so as a step toward doing this, we’re really excited about a long-term partnership that we have with the Broad Institute of MIT and Harvard, specifically focusing on cancer cell biology and this notion of, how do we think about cell states in cancer beyond just the genomic measurements of point mutations and alterations that are in DNA? But how to actually use AI to learn representations of cells as a whole and how they interact with other local influences in their microenvironment. And so, we’ve launched a a research collaboration with the Broad Institute called Project Ex Vivo that really kind of exemplifies this theme of bidirectionality. And in fact, we have a dedicated lab group and wet lab space at the Broad Institute where we can actually, as computationalists and AI scientists, go in and inform the design of experiments that are generating data and also validating the predictions of our models that we’re developing. And I think that speaks to this core challenge for this next phase of AI for biology, of how do we actually build synergistic relationships that are truly bidirectional, and use that as a catalyst to actually bring these modeling efforts to the biological lab. So that’s something that’s at a high level really exciting and really top of mind for me.

Poon: Awesome. That’s really fascinating because you’re basically, you know, bridging the dry and the wet lab, right. So you’re basically having the infinity loop start kicking in, having a discovery cycle kind of like really going on steroids. That’s really fascinating. So along that same line, Lili, you mentioned earlier, the aspect that you’ve been focusing on is a lot of this kind of remote sensing opportunity, right, and often to some extent addressing some of the gap that conventional clinical data collection may or may not have at high density. So, kind of like similar question to you is like, what do you see as some of the major bottlenecks and what does what do you see as some of the key learning?

Qiu: Sure, Hoifung. Yeah, continuous health monitoring has lots of advantages, but it does introduce significant challenges. Here, I will quickly go over five main challenges that we’ve been looking at. First is achieving ease of use. Hospital-grade monitoring devices often require professional medical staff to operate, but in-home devices need to be very intuitive and user friendly so that elderly people such as grandma and grandpa can operate them independently. To address that, we are developing very simple wearable sensing, such as earphone sensing that requires minimal user intervention. Second, we need to achieve high precision and robustness. In-home health monitoring devices are typically designed with cost-efficiency in mind, and ideally, they need to support noninvasive sensing. This makes it very challenging to achieve with high accuracy and robustness. To enhance reliability, we are exploring hardware and software co-design. On the hardware side, we are developing advanced metasurface to dramatically enhance the sensing resolution. Metasurface, consist of many sub-wavelength cells like shown here. And these elements are designed to manipulate waves in ways that conventional material cannot. By carefully designing the geometry and material properties and spatial arrangement, we can enable metasurface to have very high flexible control over wavefront shaping and achieve high-resolution sensing. On the software side, we are developing advanced machine learning algorithms to support multimodality data. Third is to increase the sensing range and support mobility. Users expect a continuous monitoring regardless of their location, so sensing should continue to work even when users are moving around throughout their environment. So we developed a metasurface-based approach to enhance the sensing range so that a user can be sensed even when they are far away, or even when they are covered with blankets. Fourth is the mental state sensing and training. Continuous sensing does not just track vital signs, but also provides real time insights into a person’s cognitive and emotional well-being. We are developing an approach to track a user’s mental state. Meanwhile, we are also developing a large language model-based cognitive training tool that draws on each user’s daily experience to reinforce episodic memory. This ensures these exercises not only feel relevant but also engaging. We demoed our cognitive training tool on last year’s World Alzheimer’s Day and receive great feedback. And we think continuous at-home monitoring and tailored intervention can potentially improve cognitive care. Last but not least, achieving timely processing and energy efficiency. Because this continuous monitoring can generate a large volume of data that must be processed in real time on a user’s device. So our team is actively working on edge computing and lightweight modeling to achieve low latency and enhance energy efficiency.

Poon: Wow. That’s fascinating really. Thanks so much for sharing. When I personally think about continual monitoring, the first thing that comes to mind is like the continual is the glucose monitoring, right? For diabetes patients. But what you touch on, some of the cool sci-fi stuff, like this metasurface. Like people maybe eventually can wear a shirt, right, to keep monitoring and also go beyond just metabolic but also some of the mental and other, that that’s really fascinating. So, I want to continue toward more and more clinical application. So Carlo, you earlier mentioned about sort of like really two themes, right. So one is when we think about precision health, one of the marquee poster child? is cancer. So you mentioned some of the application side like tumor board trial matchings and so forth. You also mentioned another part which is on the technological front, right. Like how do we go about some of this kind of like, emerging modality, like pathology, but also even spatial omics? And also you seem to also touch on some of what Ava and Matt alluded to also eventually how do you pull those multilingual right, multi-model together. And so again, similar question is like what do you see in both fronts, application and technology? What are some of the key challenges that you see that really keep you up at night?

And what is some of your learning over the years?

Bifulco: Yeah. So super exciting to hear from Ava and from Lili on the progress that you can make in basic science and, and also, on the, not on the patient, but at a personal level in in collection of large data sets and so forth. On the clinical side, things are a little bit more challenging because we have a lot of overhead coming from the complexity of patient care, regulatory frameworks and so forth. So the progress is, you know, is great, but not always as great as I think as it could be. There are some sweet spots still that I feel are, you know, where we can move rapidly. And I think those have to do with the areas that are really of translation. So thinking of clinical trials is a setting where we can move very rapidly and really make an impact. Biomarkers is another area where we can do to same, real-world evidence is another area where things can move super rapidly. Having said that, you know, I’m super optimistic about the impact of AI. Even in routine patient care. I mean, I don’t want to sound too worried about this. I think there’s a very bright future ahead of us, and almost inevitable. But I just wanted to acknowledge that there are some complexities and some challenges in bringing this technologies to fruition in a conventional setting. And all of those can be, you know, overcome. But they will require a big effort at multiple levels.

Poon: Awesome. And and so now back to you, Matt. So as you, mentioned. You have this really amazing, almost like panorama kind of wheel, from academic to industries. And now really leading our scientific effort on our product health and life sciences product division. So, what do you see? Kind of like coming to see from all these angles, from foundational research, to a partnership to productization. What do you see ultimately, how do we address some of this last mile? Some of like what Carlo mentions. How do we really make the I count for patient care.

Lungren: Yeah. I was just kind of just reflecting on some of the comments around the, the incredible innovations that have that have been going on. This is just obviously, as you know, a small sample of some of the tremendous work that we see. And then I do my clinical time and I walk by a fax machine in my office, right. So there is a gap. There’s no question about it. Carlo knows very well. And, you know, on a macro scale you can see it in sort of the conferences. You can go to NeurIPS and see some incredible things, and then you can go to the American Medical Association conferences. There is a divide, but I think that’s closing. And I’m extremely optimistic. And part of it is, you know, you look at successful applications of AI to solve a clinical problem. And I’ll just use one example that much of the audience may be familiar with, which, you know, when you think about what are the immediate needs, like before we can get to the future state, the things we want to see happen, we have to solve the problem. We have to put out the initial fire. And the fire, of course, is really burnout, right? We have a crisis obviously, in the U.S., but certainly, this is this is an issue, I think, across the world. And clinicians have too much to do, too much complexity to sort of handle. And they’re looking for some time back, if I’m being honest. And so, you know, when I look at some of these technologies, I sort of say, we’re going to save time before we save lives. And the DAX Copilot solution, if you if you’ve heard of it, it sounds very simple on its face. Maybe you wouldn’t, you know, present this as the keynote at a computer science conference, but yet it’s solving an incredible problem. And what’s interesting about it is by saving that doctor some time, you’re also generating kind of a comfort level with, hey, this AI solution is actually making my life better. But it’s also to the to the point about the last mile. It’s also in the last mile. And from that place can we build out there? So, you know, just to use an example, Lili’s talking about some biometrics. Well, there’s voice data that comes from that interaction that then gets transcribed into a note. Well, there’s voice biometrics that we all are starting to become very familiar with. Well, now I can start to add that in. And now there’s almost a foothold right into the real clinical world. And you’re building from a place that physicians are kind of congregating around or at least familiar with. And so, I’m extremely bullish. And that’s outside of just the fact that maybe more so than ever, you see seen the New England Journal has a new journal about AI. That is a huge sign to me that there is an absolute, you know, Cambrian explosion of interest from the clinical side, not just sort of in the technology, but just how do we use this and how do we learn about applications for this? So, maybe more so than ever, I feel like that that divide is starting to really close.

Poon: Yeah, that that’s a really amazing insight. You know, you mentioned about, DAX Copilot. Right. And but also this whole feel about kind of ambient intelligence, right? That it feels like, you know, even just a couple of years ago. This feels super kind of sci-fi novel concept that you could have patient and doctor talking. And then actually the computer actually take care of the notetaking, right? And maybe even start to alert, you know, some of the things that maybe the doctor may have forgotten and so forth. Even, again, even a couple of years ago, felt so sci-fi. But now actually, DAX has, right, like, as you mentioned, has already taken so much, kind of, [INAUDIBLE]. But also there is a lot of other kind of similar undertaking that really testify to how this notion that like the genAI revolution can really move things much faster, right? So that that is so, so exciting. And also, I really love what you bring up about like, save time before saving life. So by starting from sort of like productivity again, then eventually we could really get to the even the creativity gain. So that that’s super, super exciting. So, I think we can talk, you know, for hours and hours, but I think we, we have, some final chance to start to kind of elicit some of the other insights from you guys. And I think this will be a very fun run. So we’ll basically ask each of you to make a prediction for the future? Like what might happen in five years, right? And so, Ava, can we start from you?

Amini: I have the hard job of going first for this question (laughing)

Poon: Or the easy one (laughing)

Ava: I think, you know, honestly, we’re at this incredible moment of a true inflection point in our capabilities with technologies like AI and it’s really impossible to predict exactly what’s going to happen next. But I have my vision and ideas about what that landscape could look like. And so right now, we’ve seen breakthroughs in our ability to use the power of generative AI and foundational models to learn over these individual biological modalities. So, for example, protein sequences like our work with EvoDiff or pathology data like GigaPath and Virchow and other biological modalities that are coming to the forefront. As an immediate step, I think we’re going to see a shift in bringing this information together, right? To be able to reason across the scales and the hierarchy of biology, from the level of individual molecules to cells to tissues to patients, and try to, you know, put this information together from a modeling perspective and also from an application perspective. And then I think looking forward to a few years down the line hopefully that leads to this vision that you raised Hoifung of how do we actually bridge that iterative loop between the experimental world and the computational world, such that we have AIs that can help us propose new hypotheses and go test them in the lab. And I think that’s something that I’m optimistic to see in five years. And hopefully that leads to drugs, therapies that are discovered or designed from scratch by AI and are put into pre-clinical or clinical testing. And I think that’s a very real possibility to think about. And then finally, to close, you know, I think this panel is really awesome because we have these different perspectives of, you know, someone thinking about fundamental research at the biological scale, like myself and Lili, with the perspectives on, you know, how do we interact with humans in their real lives and in kind of a cultural sense and the clinical partnership with Providence and, you know, the scaling perspective that Matt brings. And my ultimate vision is that five years from now, we have a way as Microsoft to come together and put these pieces, like bring these pieces together to life, to really create solutions at scale and that’s pretty high level. But I think that we are empowered to do that. And that’s what I’m really excited about.

Poon: That’s amazing. Just to echo some of your points about this kind of multi-scale learning, right, to bridge between different scales, by biological mechanisms and so far, but also really to, actually closing this discovery and translation loop. So that that’s super exciting. So, Lili, you highlight did this really important part to how can we fill in some of this CAB modality that conventionally may not be collected. So I’m curious, if you’ve thought about in five years. What do you see could actually happen. Will we all have metasurface vests? (Laughing)

Qiu: Yeah. That’s our hope. First of all, I’d like to express my deep appreciation for the opportunity to learn from Hoifung, Ava, Carlo and Matt, your perspective. For continuous health monitoring, what I see is, it’s more than just a technology advancement. It’s a paradigm shift in healthcare. So it means that healthcare is no longer constrained inside a hospital. It’s everywhere. No matter where we are, every step we move, every eye movement we make, every breath we take, every heartbeat we strike, every brainwave signal we generate is being sensed and analyzed to understand our health condition. And what we envision for the future is there will be no more travel, no more waiting room, no more frequent needle pricks, just timely virtual interactions with doctors empowered by advanced sensing and AI. And this shift is not only meaning convenience, but also hope. So with the aid of continuous monitoring, even minor physiological frustration, such as a slight rise in blood pressure or subtle changes in heart rhythm or blood glucose level, can be detected before they escalate into life threatening crises. By continuously tracking each patient’s state, a doctor can tailor their treatments and adapt immediately, rather than waiting on tests only taken at occasional hospital visits.

So our vision is to put advanced diagnostic and treatment at people’s fingertips so that they can use at home to make healthcare more proactive, personalized and powerful than ever before.

Poon: Oh, that’s amazing. Thanks. Thanks so much, Lili. when I was posing this question about in five years, I was also thinking about how people constantly make this kind of joke about in five years from now. And then, obviously this day, one of the hot topics would be AGI. Right. So apparently probably we will get AGI before five years. But when we think about, let’s say curing cancer, right? So like I joke with my team that let, let’s try to solve cancer before we get one. And so Carlo, I’m curious about what’s your take in five years? Like how close are we to start fundamentally, understanding, you know, cracking some of the cancer mechanisms? Will we be there in five years?

Bifulco: Yeah. Great question. And also hard act to follow, Ava and Lili’s, vision. But I don’t know how close we are going to be in fully understanding cancer, but I feel very confident that we will look back to our current stage as a dark age. Because I do think the impact of AI is going to be transformational. That applies to both biology as we heard before from Ava, you know, I do think we are currently generating so much data, and the capability of generating is so large that we need AI to actually manage that. But it also goes, you know, back to what Lili said. I do think in the future, AI will interact with people before they go to see a doctor. From a medical perspective, actually ideally not just medically, it’s health. It will really drive their health and their behaviors from a health point of view. In terms of medicine itself, I do think that we will reach a state where physicians will not be allowed to see patients without a companion, an AI companion of some point. I don’t know if that’s going to happen in five years, but I do think is going to be inevitable at some point. And lastly, you know, vis a vis Matt, you know, radiologists, pathologists. I think pathology and radiology are going to fuse and merge. I’m not sure is going to be great news for the radiology. But I think all the imaging modalities are going to be one single thing. And AI is going to mediate that interaction with the physicians and that kind of landscape.

Poon: Yeah. Fascinating. And, and I think also, as Ava mentioned earlier, aside from this existing conventional imaging modality, there are also a lot of these emerging, like measuring cell states and everything, and then coupling all of them together. It’s kind of like incredible, like a microscope to some extent. A virtual microscope. But also incredibly challenging because like you said, the high dimensionality, the noise and everything and, and, and really sort of tailor make for AI to hopefully help. And so, the should we say billion dollar question, the trillion dollar question, how can we really make this kind of viable, to put this technology to eventually democratizing this, high-quality healthcare to everyone? Matt, we really want to lean on you with all your diverse perspective. What do you see would be happening in five years?

Lungren: Well, I mean, I can at least predict that we won’t have fax machines anymore.

Poon: You sure? (laughing)

Lungren: I’m willing to bet on that one. Five years is hard, right? In this current environment. We’re seeing some pretty surprising acceleration in sort of the general purpose technologies, I think you’ve heard from some of the speakers. So I mean, we are clearly, I think advancing science at a pace that at least I haven’t seen in my career in my career in the last 20 years. So, with that being said, you know, I think that there’s a couple things I want to I really want to emphasize, what Lili said. I had a mentor once that said, you know, he was the chair of a very prestigious department and he always would go to these ribbon-cutting ceremonies when we’re celebrating the opening of hospitals. And he always said, why are we doing that? Like we should be celebrating the closing of hospitals. Like we really should be advancing care to the point where a lot of these things are preventative. We’re taking care of these things outside of the, you know, the four walls of the hospital where errors can occur. There’s all kinds of potential risks that go beyond just the conditions you have and obviously the complexity of that. So I do celebrate that. And I think that, you know, some of the steps in terms of how we scale to that, to that point, I mean, one of them is, you know, it’s not sexy to talk about, but it’s data. And we’ve done a tremendous job, I think, for the most part, in starting to collect data and digitize data and to be able to manipulate in its individual modalities. But what we haven’t done a great job of is really leveraging that data to, to drive insights even with what we have. This is beyond just sort of, you know, new data and new discovery. And I think, you know, in five years, I would be surprised if we don’t look back and really facepalm at the idea that we were leaving all this data on the table sort in archives and then only using it in episodic, you know, taking care of a patient and, and not using for insights such as real world evidence. How can I find similar patients, from practice-based evidence. Right. And leveraging that decision-making from, from real-world data. I think that is, something that, you know, we’re really close to at least at least seeing the possibility or pathway towards when it comes to some of the, I think the, the connection between the biological space and I think is Lili said, you know, the wearable space. I think there’s some connections there that, you know, I would hazard to guess are really going to start to come into fruition. I’m thinking specifically about, you know, less invasive monitoring over time. But then also tying that to, you know, biomarkers that may predict early development of cancer and other chronic diseases. And then to Carlo’s point, hey, I’d be been great with partnering up with you, in the reading room. We both love dark rooms and, you know, we can work together to take care of patients in our dark rooms. I’m good with that.

Bifulco: I’m looking forward.

Poon: Well, that that’s incredible. One thing that, Matt, you mentioned about that data, right? Like, I kind of learned from John Halamka from Mayo. He mentioned that actually, by some estimates, even today, 3 billion people in the globe already have many of their medical records digitized. But when you think about it what are we using those data? Right now it’s mostly like some immediate diagnosis, some billing. Right. Then they’re locked in the attics gathering dust. So to your point, like, like actually that’s a super exciting opportunity from all of you that is like, you know, genAI could potentially be able to understand the language of those data, multi-model longitudinal data. And then harnessing them to really for good use. So this is amazing and really thankful for all the panelists. Today I think we heard from like all the way from, you know, biomolecules and proteins and all the way to continuous sensing to, you know, patients and also from research and productization. So I personally learned quite a bit and, and I, I feel even more energized. So thank you very much. Thanks to everyone for listening.

So thank you.

Lungren: Thank you.

Qiu: Thank you.

Amini: Thank you.

Bifulco: Thank you.

Research Lab Microsoft Health Futures

Group Precision Population Health

Publication Widespread Adoption of Precision Anticancer Therapies After Implementation of Pathologist-Directed Comprehensive Genomic Profiling Across a Large US Health System

The post Panel Discussion: AI for Precision Health: Learning the language of nature and patients appeared first on Microsoft Research.

Keynote: Multimodal generative AI for precision health

Microsoft Research Team — Tue, 25 Feb 2025 19:12:58 +0000

Presented by Hoifung Poon at Microsoft Research Forum, February 2025

“Using Microsoft AI, Providence researchers were able to find actionable biomarkers for a majority of patients and, consequently, many patients were prescribed with precision therapy, which substantially increased overall survival.”
– Hoifung Poon, General Manager, Microsoft Health Futures

Microsoft research copilot experience How does the integration of patient data using generative AI transform the future of cancer treatment?

Transcript: Keynote

Multimodal generative AI for precision health

Hoifung Poon, General Manager, Microsoft Health Futures

This talk introduces an agenda in precision health, utilizing generative AI to pretrain high-fidelity patient embeddings from multimodal, longitudinal patient journeys. This approach unlocks population-scale real-world evidence, optimizing clinical care and accelerating biomedical discovery.

Microsoft Research Forum, February 25, 2025

WILL GUYMAN, Group Product Manager, Healthcare AI Models: It is my pleasure to introduce my colleague Hoifung Poon, an expert in healthcare AI and general manager of Microsoft Health Futures, to talk about utilizing generative AI to enable precision healthcare. In addition to advancing the frontier of medical AI, Hoifung and Microsoft Research have deeply invested in bridging the gap between research and our clinical partners across the ecosystem. I always leave inspired after hearing Hoifung talk, and I’m sure you’ll feel the same. Over to you, Hoifung.

HOIFUNG POON: Hi, everyone. My name is Hoifung Poon. I am general manager at Microsoft Health Futures. I lead Biomedical AI Research and Incubation for precision health, with a particular focus on advancing multimodal gen-AI [generative AI], to unlock the population-scale real-world evidence. So, in the ideal world, we want every patient to be able to respond to the treatment they have been prescribed, as signified by the blue person here on the graph on the left.

In the real world, unfortunately, many patients do not respond to the treatment, as signified by the red person here. So this is obviously the fundamental challenge in biomedicine, and cancer is really the poster child of this problem. For example, immunotherapy is the cutting edge of cancer treatment. And, indeed, blockbuster drugs such as Keytruda can work miracles on some of the late-stage cancer patients.

However, the overall response rates still hover around 20 to 30%. Now when the standard of care fails, which is often the case for cancer, clinical trials become the last hope. Here’s Martin Tenenbaum, a successful AI researcher and e-commerce entrepreneur. At the peak of his career, Marty was diagnosed with late-stage melanoma. But fortunately for Marty, he was able to mobilize his network to find a matching trial that cured his cancer.

However, most patients are not as lucky or resourceful as Marty. Even in the US, only a small portion of patients were able to find matching trials, whereas a lot of cancer trials fail simply because they couldn’t find enough patients. Developing a new drug is notoriously hard, taking billions of dollars and over a decade. And this will become increasingly unsustainable in precision health as we actually have to develop more drugs, each applicable to smaller subpopulations.

When we think about drug development, oftentimes the first thing that comes to mind is early discovery. Now, this is indeed super exciting and foundational, but in the grand scheme of things it’s only 10 to 20% of the total costs. Most of the astronomical costs in drug development actually stem from later stages of clinical trials and post-market. Interestingly, this also happens to be the most low-hanging area, with immediate opportunity for major disruptions.

For example, a single phase-three cancer trial can cost hundreds of millions of dollars, and we only get back a few thousand data points. And the whole process is so inefficient. But there is a lot of potential in actually changing this by harnessing AI to unlock population-scale real evidence. In the past couple of decades, there has been rapid digitization of patient records.

And every day there are literally billions and billions of data points collected in routine clinical care about a patient journey from diagnosis to treatment to outcome. At the beginning of a patient journey, even the best doctor doesn’t have a perfect crystal ball on what might happen next. So, each journey is essentially a mini-trial, and each encounter brings forth new information.

If we can crack the code and unlock the insight underneath, this is essentially a population-scale free lunch. So, for example, here is the de-identify journey of a cancer patient, where each bar is clinical notes. So, you can see there are many, many note types, and also each note contains a lot of detailed information about the patient journey.

Additionally, there are a lot of information-rich modalities, from medical imaging to multi-omics. So, each of these modalities is trying to tell us something about the patient, but each is inherently limited. Only by assimilating all these kinds of modalities can we recapitulate a holistic kind of patient representation. So, from a machine learning point of view, precision health amounts to learning a function that inputs a multimodal patient journey and then outputs key medical events, such as disease progression and counterfactual tumor response.

If we can predict them well, we have essentially solved precision health. Now, of course, as you can guess, this is not so easy, right? So, a patient journey is not just a snapshot, but actually a longitudinal time series. More annoyingly, most of the information that we want to have is actually missing, and even the observable information can be very noisy and also contain a lot of biases.

But this is exactly why gen-AI can be so promising for precision health. The underpinning of gen-AI is a generative model, overall the joint distribution of all those chemical variables. So this enables us to compress all the observable information into a patient embedding, which can then help predict the missing information. And then predicting the next medical event is essentially a special case.

So, our overarching agenda essentially lies in how can we harness those population-scale, real-world data to portray a high-fidelity patient embedding that can serve as a digital twin for the patient. And, given the patient embedding, we can then conduct patient reasoning at the population scale. For example, after the cancer diagnosis, instead of spending months and tons of resources to seek a second opinion, we can essentially snap a finger to get millions of opinions from the most similar patients.

We can interrogate the patient journey, such as a treatment pathway and longitudinal outcome. And this can immediately help improve patient care. We can also compare non-responder versus exceptional responder and start probing mysteries, such as why those 80% of patients do not respond to Keytruda. And in this way we can essentially unlock all those emerging capabilities from the population-scale real-world evidence that actually allow us to shatter the glass ceiling of today’s healthcare common sense.

So this is very, very exciting, but the forward path is incredibly challenging. Even the best frontier models have a major competency gap for an ever-growing, long list of non-text modalities in biomedicine. So, over the past decade or so, we have blazed a new trail by essentially conducting curriculum learning over three giant free lunches.

The first free lunch stems from unimodal data, such as nanotech images. So, here a general recipe for self-supervision lies in pre-training modality-specific encoders and decoders. And then that can compress the input into an embedding, and then decompress it back to reproduce the original input. So for text, we can also simply piggyback on existing frontier models that are already very, very good at understanding and reasoning with biomedical texts.

Now, this general recipe is universally applicable and very powerful biomedicine, and there are also a whole slew of, kind of like, modality-specific challenges that require major research innovations. For example, digital pathology is well known to contain a lot of key information about tumor microenvironments, such as how immune cells interact with cancer cells, which is crucial for deciphering resistance to, immunotherapy.

So here, transformer is the workhorse of the gen-AI and in theory is actually perfect for modeling such a complex global presence. However, pathology slides are actually among the largest in the world. A single whole-slide image can be hundreds of thousands of times larger than standard web images, which means that it will require billions of times more computation due to the quadratic growth in transformer. So, to address this problem, a promising direction is to incorporate this idea called dilated attention, which originated from speech recognition that also has a big problem in modeling long contexts. So, for images, transformer essentially works by having pixels passing messages with each other, which is why it leads to the quadratic growth in compute.

So in dilated attention, for smaller blocks and in local neighborhoods, we will still keep using the full self-attention with the pairwise message passing. But when we pass messages in larger blocks, we will instead try to essentially elect representatives for the local neighborhoods and then only pass messages among those two representatives. So for larger and larger blocks, we will elect sparser and sparser representatives. And in this way, we can perfectly cancel out the quadratic growth.

So, by adapting dilated attention to digital pathology and in collaboration with Providence Health System and University of Washington, we have created GigaPath, the world’s first digital pathology foundation model that can truly scale transformer to the whole-size image. And this paper was published by Nature last year, and we are very excited to see that in the few months since its publication, GigaPath has already been downloaded well over half a million times across the globe.

We are super psyched to see the community’s interest. And we have also made a ton of progress in other modalities, such as CT and spatial multi-omics. So, the unimodal pre-training is a very good first step, but there are even bigger challenges. So, for example, a pathology foundation model may learn to map a tumor lesion somewhere in the embedding space, whereas a CT foundation model may map it elsewhere.

Each modality is trying to tell us something about a patient, but each is speaking its own distinct language. So, this is essentially analogous to the translation problem for human languages. And in the translation space, right, to deal with the multilingual explosion, machine translation systems will usually introduce a resource-rich language, such as English, as an interlingua to bridge among those low-resource languages.

For example, there may not be any parallel data between a language in Africa and a sub-language in India, but we can translate from the African language to English and then from English to the sub-language in India. And this is, indeed, how commercial machine translation systems scale to hundreds of languages in the world. So, here we propose to follow the same recipe in dealing with the multimodal complexity in biomedicine by introducing interlingual modality, and text is an ideal candidate to serve as this interlingua.

We already have very powerful frontier models for the biomedical text modality and, moreover, for any non-text modality under the sun. The study of the modality involves natural languages, which means that there are a lot of readily available modality text pairs such as a pathology slide and the corresponding pathology report. We can piggyback on the unimodal pre-training in the first stage by reusing those encoders and decoders, and then focus on using the modality text pairs to pre-train a lightweight adapter layer. And the adapter layer essentially translates from the modality embedding to the text-semantic space. So, this enables all the modalities to start to speak in the same language, and also helps propagate a lot of the rich prior knowledge that has already been captured in the text-semantic space back to individual modalities to help with their interpretation.

So, for more detail about this general recipe, you can check out our LLaVA-Med paper, which was spotlighted in NeurIPS. So, here I also want to add that the LLaVA paradigm also represents a trailblazing innovation at MSR [Microsoft Research] by harnessing the text-processing capability of frontier models to synthesize multimodal instruction following data. So, this has since become a standard practice, including training multimodal Phi and other popular vision-language models.

Now, we can extend this recipe to include a pixel-level decoder for holistic image analysis. So, this enables us to develop BiomedParse, which can conduct object recognition, detection, and segmentation in one fell swoop through a unified natural-language interface. So, you can essentially talk to the image to conduct an analysis. So, BiomedParse is a single foundation model that can attain state-of-the-art performance across nine modalities and six major object types.

It was just published by Nature Methods and, in the same issue, Nature Methods also published an external review that called BiomedParse a groundbreaking biomed AI foundation model (opens in new tab) and said that the implications of BiomedParse are profound. So these are all very, very exciting, but we still have one last giant free lunch that lies in the very patient journeys themselves.

So, recall that GPT essentially learns by predicting next token, next token, next token. Right? And in the same way, our patient embedding can actually learn by predicting the next medical event and next medical event. So, in this way, we can essentially turn every single patient journey into a self-supervision training instance. So we have conducted some initial explorations on the structure of medical events using a public data set.

Interestingly, scaling laws established for text actually are not very far away from the structure of medical events. And we are now extending the study to much larger datasets. So, ultimately, we can imagine the embedding not just for patients, but also for interventions, for clinical trials, etc. And in this way, we can potentially develop a universal embedding calculus for precision health.

As we mentioned earlier, clinical trial is the perfect beachhead. Providence is the third-largest health system in the US, and they have been using our research system daily now in their tumor board to screen thousands of patients a year, including this high-profile trial featured by The New York Times. Using Microsoft AI, Providence researchers were able to find actionable biomarkers for a majority of patients and, consequently, many patients were prescribed with precision therapy, which substantially increased overall survival.

So this is super, super exciting. So, ultimately, the dream is to drastically scale high-quality health care and drastically reduce healthcare costs. And, thereby, we can democratize such high-quality health care to, essentially, really for everyone. So, with the clinical trial matching capability, we can also essentially snap a finger and control our virtual case arm and control arm, and then conduct clinical research, hypothesis generation, and test using real-world, data.

A lot of those marquee lung cancer trials that can cost hundreds of millions of dollars to run can be simulated using real-world data as we have been shown with Providence collaborators, including the original key to the trial.

Now, obviously, with exciting moonshots such as precision health, it takes way more than a village. At Microsoft Research, we are super blessed by being able to collaborate in depth with talented researchers across Microsoft Research itself as well as with academia and with a lot of key health stakeholders, such as large health systems and life-sciences companies. Many of the frontier biomedical models we have highlighted in this talk are already publicly available in Azure AI Foundry (opens in new tab).

Now, obviously, much more remains to be done. But even with what we have today, there is already a lot that we can bring forth in positive disruption to scale drug development and improve patient care.

Thank you.