Microsoft Research http://approjects.co.za/?big=en-us/research/ Wed, 01 Apr 2026 16:01:00 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 ADeLe: Predicting and explaining AI performance across tasks http://approjects.co.za/?big=en-us/research/blog/adele-predicting-and-explaining-ai-performance-across-tasks/ Wed, 01 Apr 2026 16:00:58 +0000 http://approjects.co.za/?big=en-us/research/?p=1167099 AI benchmarks report how large language models (LLMs) perform on specific tasks but provide little insight into their underlying capabilities that drive their performance. They do not explain failures or reliably predict outcomes on new tasks. To address this, Microsoft researchers in collaboration with Princeton University and Universitat Politècnica de València introduce ADeLe (opens in new tab) (AI […]

The post ADeLe: Predicting and explaining AI performance across tasks appeared first on Microsoft Research.

]]>
ADeLe | Three white line icons, showing a circle with a checkmark, a search document, and a set of tools, on a blue‑to‑green gradient background.

At a glance

  • AI benchmarks report performance on specific tasks but provide limited insight into underlying capabilities; ADeLe evaluates models by scoring both tasks and models across 18 core abilities, enabling direct comparison between task demands and model capabilities.
  • Using these ability scores, the method predicts performance on new tasks with ~88% accuracy, including for models such as GPT-4o and Llama-3.1.
  • It builds ability profiles and identifies where models are likely to succeed or fail, highlighting strengths and limitations across tasks.
  • By linking outcomes to task demands, ADeLe explains differences in performance, showing how it changes as task complexity increases.

AI benchmarks report how large language models (LLMs) perform on specific tasks but provide little insight into their underlying capabilities that drive their performance. They do not explain failures or reliably predict outcomes on new tasks. To address this, Microsoft researchers in collaboration with Princeton University and Universitat Politècnica de València introduce ADeLe (opens in new tab) (AI Evaluation with Demand Levels), a method that characterizes both models and tasks using a broad set of capabilities, such as reasoning and domain knowledge, so performance on new tasks can be predicted and linked to specific strengths and weaknesses in a model.

In a paper published in Nature, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power (opens in new tab),” the team describes how ADeLe moves beyond aggregate benchmark scores. Rather than treating evaluation as a collection of isolated tests, it represents both benchmarks and LLMs using the same set of capability scores. These scores can then be used to estimate how a model will perform on tasks it has not encountered before. The research was supported by Microsoft’s Accelerating Foundation Models Research (AFMR) grant program.

ADeLe-based evaluation

ADeLe scores tasks across 18 core abilities, such as attention, reasoning, domain knowledge, and assigns each task a value from 0 to 5 based on how much it requires each ability. For example, a basic arithmetic problem might score low on quantitative reasoning, but an Olympiad-level proof would score much higher.

Evaluating a model across many such tasks produces an ability profile—a structured view of where the model performs and where it breaks down. Comparing this profile to the demands of a new task makes it possible to identify the specific gaps that lead to failure. The process is illustrated in Figure 1.

Diagram illustrating a two-stage AI evaluation framework: the top panel shows model performance on the ADeLe benchmark and resulting ability profiles, while the bottom panel shows how task-level scoring criteria are applied to derive task demand profiles.
Figure 1. Top: (1) Model performance on the ADeLe benchmark and (2) the resulting ability profiles, showing each model’s strengths and limitations across core abilities. Bottom: (1) Application of 18 scoring criteria to each task and (2) the resulting task profiles, showing the abilities each task requires.

Evaluating ADeLe

Using ADeLe, the team evaluated a range of AI benchmarks and model behaviors to understand what current evaluations capture and what they miss. The results show that many widely used benchmarks provide an incomplete and sometimes misleading picture of model capabilities and that a more structured approach can clarify those gaps and help predict how models will behave in new settings.

ADeLe shows that many benchmarks do not isolate the abilities they are intended to measure or only cover a limited range of difficulty levels. For example, a test designed to evaluate logical reasoning may also depend heavily on specialized knowledge or metacognition. Others focus on a narrow range of difficulty, omitting both simpler and more complex cases. By scoring tasks based on the abilities they require, ADeLe makes these mismatches visible and provides a way to diagnose existing benchmarks and design better ones.

Applying this framework to 15 LLMs, the team constructed ability profiles using 0–5 scores for each of 18 abilities. For each ability, the team measured how performance changes with task difficulty and used the difficulty level at which the model has a 50% chance of success as its ability score. Figure 2 illustrates these results as radial plots that show where the model performs well and where it breaks down.

Radar charts comparing ability profiles of 15 large language models across 18 abilities, grouped by model families: OpenAI models on the left, LLaMA models in the middle, and DeepSeek-R1-Distill-Qwen models on the right.
Figure 2. Ability profiles for 15 LLMs across 18 abilities. Left: OpenAI models. Middle: Llama models. Right: DeepSeek-R1 distilled models.

This analysis shows that models differ in their strengths and weaknesses across abilities. Newer models generally outperform older ones, but not consistently across all abilities. Performance on knowledge-heavy tasks depends strongly on model size and training, while reasoning-oriented models show clear gains on tasks requiring logic, learning, abstraction, and social inference. These patterns typically require multiple, separate analyses across different benchmarks and can still produce conflicting conclusions when task demands are not carefully controlled. ADeLe surfaces them within a single framework.

ADeLe also enables prediction. By comparing a model’s ability profile to the demands of a task, it can forecast whether the model will succeed, even on tasks that are unfamiliar. In experiments, this approach achieved approximately 88% accuracy for models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This makes it possible to both explain and anticipate potential failures before deployment, improving the reliability and predictability of AI model assessment.

Whether AI systems can truly reason is a central debate in the field. Some studies report strong reasoning performance, while others show they break down at scale. These results reflect differences in task difficulty. ADeLe shows that benchmarks labeled as measuring “reasoning” vary in what they require, from basic problem-solving to tasks that combine the need for advanced logic, abstraction, and domain knowledge. The same model can score above 90% on lower-demand tests and below 15% on more demanding ones, reflecting differences in task requirements rather than a change in capability.

Reasoning-oriented models like OpenAI’s o1 and GPT-5 show measurable gains over standard models—not only in logic and mathematics but also with interpreting user intent. However, performance declines as task demands increase. AI systems can reason, but only up to a point, and ADeLe identifies where that point is for each model.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience

Looking ahead

ADeLe is designed to evolve alongside advances in AI and can be extended to multimodal and embodied AI systems. It also has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.

More broadly, it advances a more systematic approach to AI evaluation—one that explains system behavior and predicts performance. This work builds on earlier efforts, including Microsoft research on applying psychometrics to AI evaluation and recent work on Societal AI, emphasizing the importance of AI evaluation.

As general-purpose AI systems continue to outpace existing evaluation methods, approaches like ADeLe offer a path toward more rigorous and transparent assessment in real-world use. The research team is working to expand this effort through a broader community. Additional experiments, benchmark annotations, and resources are available on GitHub (opens in new tab).

The post ADeLe: Predicting and explaining AI performance across tasks appeared first on Microsoft Research.

]]>
AsgardBench: A benchmark for visually grounded interactive planning http://approjects.co.za/?big=en-us/research/blog/asgardbench-a-benchmark-for-visually-grounded-interactive-planning/ Thu, 26 Mar 2026 19:02:53 +0000 http://approjects.co.za/?big=en-us/research/?p=1166723 Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is full of other items. This is the domain of embodied AI: systems […]

The post AsgardBench: A benchmark for visually grounded interactive planning appeared first on Microsoft Research.

]]>
AsgardBench | three whit icons on a blue to purple gradient background | first icon shows a laptop screen with a eye in the upper right corner, second icon shows relational nodes | third icon is a security shield with a checkmark

At a glance

  • To successfully complete tasks, embodied AI agents must ground and update their plans based on visual feedback.
  • AsgardBench isolates whether agents can use visual observations to revise their plans as tasks unfold.
  • Spanning 108 controlled task instances across 12 task types, the benchmark requires agents to adapt their plans based on what they observe.
  • Because objects can be in different positions and states (e.g., clean or dirty), the same instruction can require different action sequences, even in the same environment.

Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is full of other items. This is the domain of embodied AI: systems that perceive their environment and act within it.

The field has made rapid progress, but evaluating these systems is harder than it looks. Many benchmarks test perception, navigation, and physical control all at once, making it difficult to isolate whether an AI agent is actually using what it perceives to make better decisions or just getting lucky because the environment is predictable enough to script around.

To address this, we created AsgardBench. In the paper, AsgardBench — Evaluating Visually Grounded Interactive Planning Under Minimal Feedback,” we describe how this benchmark poses a simple but demanding challenge: give an AI agent a household task, let it observe the environment through images, and see whether it can adjust its plan when what it perceives contradicts what it anticipated. Can it notice that the mug it needs to clean is already in the sink, or that it isn’t, and behave accordingly? That is the core question AsgardBench is designed to answer.

Built on AI2-THOR, an interactive 3D simulation environment used to train and evaluate AI agents on household tasks, AsgardBench positions agents near objects and gives them a small, fixed set of actions, such as find, pickup, put, clean, and toggle_on/off. At each turn, the agent proposes a full sequence of steps to complete the task, but only the first step executes. Throughout, the focus is squarely on plan adaptation, not whether an agent can navigate a room or manipulate an object, but whether it can use what it perceives to revise its next step.

For example, the agent may discover a mug to be clean, dirty, or filled with coffee, or it may observe that a sink contains many other items, so the same instruction can require different action sequences as the task unfolds. This process is illustrated in Figure 1.

Model changes the steps in its plan as new observations are observed
Figure 1: Agent observations and corresponding action plans in AsgardBench. Each image is paired with the plan generated from that observation. This illustrates how AsgardBench requires agents to update or change their plans based on new visual evidence rather than following a fixed sequence.

How it works

Agents start in interaction-ready positions, so navigation and viewpoint selection are not factors. A find action brings objects into view, and the environment handles the details of container sizing and placement, so the agent does not need to reason about which cabinet or countertop to use. The only inputs are color images, a history of attempted actions with simple success or failure signals, and the agent’s own record of what it plans to do next.

At each turn, the agent proposes a complete sequence of steps to finish the task, but only the first step proceeds. It then receives new images and a simple signal—did that action succeed or fail? This prevents the agent from scripting everything upfront and forces it to re-evaluate and revise its plan at every step. Built-in limits on total steps and repeated actions prevent endless loops. Because the environment provides only simple feedback, the agent must be able to notice what it perceives (e.g., whether a mug is dirty, whether a faucet is running) and keep track of where it is in the task from one step to the next.

Evaluating AsgardBench

We tested several leading vision-capable models on AsgardBench and observed that high-performing models require visual grounding to consistently succeed. Across the models, visual input substantially improved performance: most models more than doubled success rates when given images versus text-only descriptions of the scene. This is in contrast to some prior benchmarks where agents could perform reasonably well without vision by relying on textual feedback on what went wrong.

Providing that kind of detailed failure information raises performance for all models in AsgardBench, too, but it can mask the real problem. The strongest vision-capable models still outperform text-only agents even when those agents are given detailed feedback, demonstrating that the benchmark requires visual grounding that text alone cannot replicate. AsgardBench’s performance is illustrated in Figure 2.

Chart showing input substantially improves performance for all but the weakest models when images are included
Figure 2. Success rates for image-based and text-only conditions. Visual input substantially improves performance for all but the weakest agents, while text-only performance remains low, indicating that AsgardBench requires perception-based reasoning.

The results also revealed where today’s agents consistently fall short. Across all models, the same problems kept appearing: agents attempted undoable actions (e.g., trying to clean a mug that was not in the sink), got stuck in repeated action loops, misinterpreted subtle visual cues (on/off, clean/dirty), and lost track of where they were in the task progress from one step to the next. This points to three weaknesses: the inability to distinguish subtle visual details in cluttered scenes, the inability to maintain an accurate picture of task progress across multiple steps, and the inability to consistently translate what the agent sees into timely updates to its plan. Taken together, these point to where the next generation of embodied agents will need to improve.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience

Implications and looking ahead

AsgardBench is useful as both a diagnostic and development tool. By varying what feedback agents receive (none, minimal, or detailed), researchers can isolate whether performance gains come from better perception, better memory, or better planning. Promising directions include systems that combine stronger visual understanding with better state tracking, training approaches that emphasize learning to repair plans mid-task, and evaluation methods that measure not just whether an agent succeeds but how well it adapted along the way.

The failure patterns AsgardBench surfaces point toward a concrete next step: building systems that can make finer visual distinctions, keep track of what changed more reliably across steps, and learn to revise plans mid-task rather than plowing ahead on a script. Agents that make progress on these challenges should be meaningfully better equipped for the messiness of real-world environments: unexpected object states, cluttered scenes, and the constant need to adapt.

AsgardBench is open source and available on GitHub (opens in new tab), providing a foundation for advancing research in visually grounded planning.

Acknowledgements

We thank the AI2-THOR community for building the simulation platform and making reproducible embodied evaluation possible.

The post AsgardBench: A benchmark for visually grounded interactive planning appeared first on Microsoft Research.

]]>
GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation http://approjects.co.za/?big=en-us/research/blog/groundedplanbench-spatially-grounded-long-horizon-task-planning-for-robot-manipulation/ Thu, 26 Mar 2026 16:03:56 +0000 Vision-language models (VLMs) use images and text to plan robot actions, but they still struggle to decide what actions to take and where to take them. Most systems split these decisions into two steps: a VLM generates a plan in natural language, and a separate model translates it into executable actions. This approach often breaks […]

The post GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation appeared first on Microsoft Research.

]]>
V2GP framework | three white icons on a blue to green gradient background | first icon is three concentric rings with a square box around it, second icon is a list of three items | third icon is a tool

At a glance

  • VLM-based robot planners struggle with long, complex tasks because natural-language plans can be ambiguous, especially when specifying both actions and locations.
  • GroundedPlanBench evaluates whether models can plan actions and determine where they should occur across diverse, real-world robot scenarios.
  • Video-to-Spatially Grounded Planning (V2GP) is a framework that converts robot demonstration videos into spatially grounded training data, enabling models to learn planning and grounding jointly.
  • Grounded planning improves both task success and action accuracy, outperforming decoupled approaches in benchmark and real-world evaluations.

Vision-language models (VLMs) use images and text to plan robot actions, but they still struggle to decide what actions to take and where to take them. Most systems split these decisions into two steps: a VLM generates a plan in natural language, and a separate model translates it into executable actions. This approach often breaks down for long, complex tasks because natural-language plans can be ambiguous or even hallucinated when specifying actions and locations (Figure 1). Because planning and spatial reasoning are handled separately, errors in one stage can propagate to the next. This raises a key question: can a VLM determine both what to do and where to do it simultaneously?

Figure 1: This figure shows some failure cases for a vision-language robot task planner. Given the instruction “discard all paper cups to bin,” the planner produces an action sequence with ambiguous cup references and a hallucinated step, “place inside the cabinet.” Cropped object views and arrows to a language-based spatial grounding module show that ambiguous grounding can lead to non-executable plans.
Figure 1. Failures in VLM-based task planners, where ambiguous language leads to non-executable actions.

Planning with spatial grounding

To address this problem, we developed GroundedPlanBench (opens in new tab). In our paper, “Spatially Grounded Long-Horizon Task Planning in the Wild,” we describe how this new benchmark evaluates whether VLMs can plan actions and determine where those actions should occur across diverse real-world environments. We also built Video-to-Spatially Grounded Planning (V2GP), a framework that converts robot demonstration videos into training data to help VLMs learn this capability.

Evaluating these with both open- and closed-source VLMs, we found that grounded planning for long, complex tasks is challenging. At the same time, V2GP improves both planning and grounding, with gains validated on our benchmark and in real-world experiments using robots.

How GroundedPlanBench works

To create realistic robot scenarios, we built our benchmark from 308 robot manipulation scenes in the Distributed Robot Interaction Dataset (DROID) (opens in new tab), a large collection of recordings of robots performing tasks. We worked with experts to review each scene and define tasks that a robot could perform. Each task was written in two styles: explicit instructions that clearly describe the actions (e.g., “put a spoon on the white plate”) and implicit instructions that describe the goal more generally (e.g., “tidy up the table”).

For each task, the plan was broken down into four basic actions—graspplaceopen, and close—each tied to a specific location in the image. Grasp, open, and close actions were linked to a box drawn around the target object, while place actions were linked to a box showing where the object should be placed.

Figure 2 illustrates medium- and long-duration tasks, along with their explicit and implicit instructions. In total, GroundedPlanBench contains 1,009 tasks, ranging from 1–4 actions (345 tasks) to 5–8 (381) and 9–26 (283).

Figure 2: This figure shows two examples comparing explicit and implicit task instructions: one about placing bottles and a cup into a sink, and another about placing eggs and vegetables into a silver bowl. The figure shows that implicit instructions summarize explicit object lists into higher-level descriptions. This figure shows two examples comparing explicit and implicit task instructions: one about placing bottles and a cup into a sink, and another about placing eggs and vegetables into a silver bowl. The figure shows that implicit instructions summarize explicit object lists into higher-level descriptions.
Figure 2. Examples of tasks in GroundedPlanBench.

How V2GP works

The V2GP framework first detects moments when the robot interacts with objects using the recorded gripper signals. It then generates a text description of the manipulated object with a multimodal language model. Guided by this description, the system tracks the object across the video using Meta’s advanced open-vocabulary image and video segmentation model, SAM3. The system then constructs grounded plans from the tracking results, identifying the object’s location at the moment it is grasped and where it is placed.

This process is illustrated in Figure 3. It yielded 43K grounded plans with varying lengths: 34,646 plans with 1–4 actions, 4,368 with 5–8 actions, and 4,448 with 9–26 actions.

Figure 3: This figure shows an overview of the V2GP framework. Robot demonstration videos are segmented into temporal sub-actions, matched with active objects, spatially grounded with grasp boxes and placement points, and converted into unified training samples containing language instructions and structured action plans. This figure shows an overview of the V2GP framework. Robot demonstration videos are segmented into temporal sub-actions, matched with active objects, spatially grounded with grasp boxes and placement points, and converted into unified training samples containing language instructions and structured action plans.
Figure 3. The V2GP framework converts robot videos into spatially grounded plans.

Evaluating decoupled versus grounded planning

To evaluate GroundedPlanBench in real-world robotic settings, we used Qwen3-VL (opens in new tab) as our base model. Qwen3-VL is a vision-language model that processes text, images, and video to support multimodal reasoning. It performs well on standard multimodal reasoning benchmarks without additional training. We first evaluated it, along with other proprietary models, on GroundedPlanBench without any task-specific training (Table 1). We then fine-tuned it on V2GP training data and compared it with a decoupled approach, in which planning and grounding are handled separately.

In this setup, a VLM first generated a plan describing what the robot should do. We used GPT-5.2 or Qwen3-VL-4B for this step. The plan was then passed to a spatial grounding model, Embodied-R1 (opens in new tab), which converted the plans into executable signals. Embodied-R1 is a large vision-language model trained for embodied reasoning and pointing, where the model identifies specific locations in the image to guide the robot’s actions. We selected it for spatial grounding because its training targets embodied spatial reasoning and point-based localization, making it well suited for grounding model outputs to specific locations in an image.

Figure 4 highlights a key limitation of this approach: ambiguity in natural language. For example, Qwen3-VL-4B generated grasp actions by referring to “napkin on the table” for all four napkins in the scene, leading Embodied-R1 to ground each action the same napkin. GPT-5.2 produced more descriptive phrases, such as “top-left napkin” or “upper-center napkin,” but these were still too imprecise for the model to reliably distinguish between them and were again grounded to the same object.

Figure 4: This figure shows a comparison of planning methods for the instruction “Put four napkins on the couch.” Several baseline methods ground actions to the wrong objects, while the grounded V2GP method correctly identifies the napkins and their placement locations.
Figure 4. Decoupled vs. grounded planning, illustrating how ambiguous language causes actions to be grounded to the wrong objects.

This limitation becomes more pronounced in real-world robot manipulation, where environments are often cluttered and complex. As a result, decoupled approaches struggle to work reliably. In contrast, our approach, grounded planning, performs planning and grounding jointly within a single model and improves both planning and grounding performance.

Table 1 presents evaluation results for open- and closed-source VLMs on GroundedPlanBench. Multi-step planning and handling of implicit instructions were challenging for all models, while training Qwen3-VL-4B and Qwen3-VL-32B with V2GP led to significant improvements in grounded planning.

Table 1: This table reports results of evaluation on GroundedPlanBench, comparing proprietary and open-source VLMs on task success rate and action recall for explicit and implicit instructions of varying lengths. V2GP achieves the best overall performance, with consistent gains over decoupled planning plus spatial grounding baselines.
Table 1. Evaluation results on GroundedPlanBench. Task Success Rate (TSR) measures the percentage of tasks completed correctly, requiring all actions to be both correctly planned and spatially grounded. Action Recall Rate (ARR) measures the proportion of generated actions that match the sub-actions defined in the dataset, regardless of order. The V2GP approach improves performance on both metrics and achieves the best results (shown in bold).

Spotlight: Microsoft research newsletter

Microsoft Research Newsletter

Stay connected to the research community at Microsoft.

Implications and looking forward

Integrating planning and grounding within a single model offers a path to more reliable robot manipulation in real-world settings. Rather than relying on separate stages, this approach keeps decisions about what to do and where to act tightly coupled, but models still struggle with longer, multi-step tasks and implicit instructions. Models must reason over longer sequences of actions and maintain consistency across many steps and goals described indirectly, as in everyday language.

Looking ahead, a promising direction combines grounded planning with world models, which enable robots to predict the outcomes of actions before executing them. Together, these capabilities could allow robots to decide what to do, where to act, and what will happen next, bringing us closer to systems that can plan and act reliably in the real world.

Acknowledgements

This research was conducted in collaboration with Korea University, Microsoft Research, University of Wisconsin-Madison, and supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (No. RS-2025-25439490) funded by the Korea government (MSIT).

The post GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation appeared first on Microsoft Research.

]]>
Will machines ever be intelligent?  http://approjects.co.za/?big=en-us/research/podcast/will-machines-ever-be-intelligent/ Mon, 23 Mar 2026 15:00:21 +0000 http://approjects.co.za/?big=en-us/research/?p=1163921 Are machines truly intelligent? AI researchers Subutai Ahmad and Nicolò Fusi join Doug Burger to compare transformer-based AI with the human brain, exploring continual learning, efficiency, and whether today’s models are on a path toward human intelligence.

The post Will machines ever be intelligent?  appeared first on Microsoft Research.

]]>

Technical advances are moving at such a rapid pace that it can be challenging to define the tomorrow we’re working toward. In The Shape of Things to Come, Microsoft Research leader Doug Burger and experts from across disciplines tease out the thorniest AI issues facing technologists, policymakers, business decision-makers, and other stakeholders today. The goal: to amplify the shared understanding needed to build a future in which the AI transition is a net positive. 

In this first episode of the series, Burger is joined by Nicolò Fusi of Microsoft Research and Subutai Ahmad (opens in new tab) of Numenta to examine whether today’s AI systems are truly intelligent. They compare transformer-based large language models (LLMs) with the human brain’s distributed, continuously learning architecture, exploring differences in efficiency, representation, and sensory-motor grounding. The discussion probes what intelligence really means, where current models excel or fall short, and what future AI systems might need to bridge the gap.

Transcript

[MUSIC] 

DOUG BURGER: This is The Shape of Things to Come, a Microsoft Research Podcast. I’m your host, Doug Burger. In this series, we’re going to venture to the bleeding edge of AI capabilities, dig down into the fundamentals, really try to understand them, and think about how these capabilities are going to change the world—for better and worse.   

In today’s podcast, I’m bringing on two AI researcher-experts: Nicolò Fusi, who is an expert in digital, transformer-based large language model architectures and learning, and Subutai Ahmad, who is an expert in biological architectures, specifically the human brain. And the question we’re going to discuss is, are machines intelligent?  

And what I mean by that: are digital intelligence, large language models, on a path to surpass humans, or are the architectures just so fundamentally different that one will do one set of things well, the other will do something else very well? And so we’ll be debating the architecture of intelligence across digital implementations and biological implementations because the answer to that question, I think, really will determine the shape of things to come. 

[MUSIC FADES] 

I’d like to ask each of my guests to introduce themselves. Tell me a little bit about your background and what you’re currently working on—to the extent you can talk about it—in AI. So, Nicolò, would you please start? 

NICOLÒ FUSI: Yeah, thank you, Doug, for having us and having me here. It’s so much fun. So I’m Nicolò Fusi. I’m a researcher at MSR [Microsoft Research]. So Doug is my boss, so I will be very, very, very good to Doug in this podcast.  

No, but jokes aside, my own background is in Bayesian nonparametric. That’s what I started studying. So Gaussian processes and things like that. And then equally, I would say, in computational biology, because I found it, like, one of the most interesting use cases for AI techniques. And that, kind of, has been true throughout my career. And pretty much like everybody else, eventually, I moved away from the kernel methods and the Bayesian nonparametrics and I started working more on language models, transformer models, with a particular eye towards information theory and the connection between information theory and generative modeling. And that’s, kind of, one of the main things I do today other than, kind of, managing the research of people who do much more interesting work than I do. [LAUGHS]  

BURGER: I have to interject there, Nicolò, because you dragged a piece of bait across my path.  

FUSI: I figured.  

BURGER: You know, at Microsoft Research, I have a management rule that I can’t tell anyone what to do because we hire some of the best people in the world. You have to trust them. And everyone is always completely free to call BS on me. And so Nicolò was joking there; [LAUGHTER] he does not have to toe the party line. In fact, I encourage him not to. So, so … 

FUSI: I just have to be well-behaved. That’s the only thing I will say. [LAUGHS] 

BURGER: Yeah. Thank you, thank you for baiting me. [LAUGHS] Because he knew exactly what he was doing. And I love him for it.  

Subutai, can you tell us a little bit about yourself? 

SUBUTAI AHMAD: Sure. Thank you so much, Doug, for having me. I’m really looking forward to the conversation between us all.  

So I see myself fundamentally as a computer scientist. You know, I’ve been studying computer science for longer than I care to admit. But something changed for me during my undergrad years. I decided to minor in cognitive psychology, and I started to get really interested in how the brain works. 

And to me, understanding intelligence and implementing intelligence was the hardest problem a computer scientists could ever solve. So I got very, very interested in that. You know, I couldn’t see how to really commercialize that. I was very interested in making products and stuff. So I stopped, you know, working on that for a while. I did a number of startups doing computer vision, you know, video processing, a lot of that stuff. 

And then when Jeff Hawkins started Numenta back in 2005 with the idea of really deeply understanding how the brain works and figuring out how to apply that to AI, for me, it was like all my worlds coming together. This, like, this is what I had to do. None of us thought [LAUGHS] it would take as long as it did. We spent the last couple of decades really deeply trying to understand neuroscience from a computer scientist—from a programmer’s—standpoint, the underlying algorithms. And that’s really what I’m passionate about, just trying to translate what we understand about the neuroscience to today’s AI.  

And in terms of what we’re working on today, it’s, you know, the human—maybe we’ll get into some of this—the brain is super efficient in how it works—power efficient, energy efficient—and we’re trying to embody those ideas and trying to make AI a lot more efficient than it is today. 

BURGER: Great. I think we’ll get into efficiency a little bit later in the podcast because that’s a subject that’s near and dear to my heart, you know, being a computer architect originally by training.  

I want to go back to, you know, one of the reasons I got involved with Numenta is, you know, Subutai and I have been exchanging emails, like, discussing collaborations, you know, visiting each other through the years, and the thing that really stuck with me was when I read one of the earlier books from Jeff On Intelligence (opens in new tab). And there was an example in the book that talked about how, you know, the human brain learns continuously. I think biological organisms in general learn continuously.  

And the anecdote that I remember was this anecdote if you’re walking down your basement steps, you know, you’re walking down the stairs to your basement and there’s one step that’s always been a few inches off and you decide to fix it, and so you raise it so it’s even with the others, and then the next time you go down the stairs, you don’t remember and you’re wildly off and, you know, you hit that step, you hit it earlier or later than you anticipated, you go out of balance. You’re flailing around. You know, you get all this adrenaline. You think you’re going to pitch headfirst down the stairs. Hopefully you don’t. And then the second time you do it, you’re a little off balance, but it’s not crazy. And the third time you maybe notice a little bit, and the fourth time, it’s, like, it’s your basement stairs. 

And so somewhere between that first time down and the third and fourth times down, there are molecular changes in your brain that have learned the new timing of your basement steps. And I remember just that example vividly from the book. And that got me thinking, wow, this is so different from the way our digital AI works. I’ll turn it over to you to comment for that and then I think we’ll go into the digital. 

AHMAD: Yeah, no, that’s a great example. I think it’s remarkable how our brain is constantly modeling our entire world at such a granular level, and we’re not even aware of it perceptually. Like, you know, that example of the steps is probably not … you wouldn’t consciously be aware of it, yet if something is different about anything in your world that you’re very familiar with, you’ll instantly notice it. And then you’ll, you know, you’ll update your world model, you’ll adjust, and you’ll continue on. It’s really remarkable how the brain’s able to do that so seamlessly. 

BURGER: And a lot of that is based on neurotransmitters, right? Because there’s just a … you know, when you have that physical reaction to “I’m about to pitch down the stairs,” you get a flood of transmitters that actually changes the way your brain’s learning or at least the rate. 

AHMAD: Yeah, there’s a flood of neurotransmitters and neuromodulators, as well, that invoke change, sometimes very rapidly. Another example, you know, if you touch a hot stove—that’s the canonical example—you will learn that very, very quickly. So there’s a lot of chemical changes that happen. But it’s also really interesting that we can update things and update our world knowledge without impacting everything else that we know. This is something that’s very, very different, again, from today’s AI models. We’re able to make these changes in a very contextual and very, sort of, fine-grained way.  

BURGER: So, Nicolò, I want to go and talk a little bit now to transformers. So I think, you know, you and I and Subutai were all working in the AI field, you know, many years before 2017, when the transformer hit. You know, I was building, you know, with my team hardware to accelerate RNNs [recurrent neural networks], LSTMs [long short-term memory], you know, which had this awful loop-carried dependence, you know, the bottlenecked computation, and then the transformer was just much more parallelizable.  

So what do you think’s really going on in these things? And maybe we could start—I know you and I have talked a lot about this—maybe just start with the major blocks. You know, you’ve got the attention layer. You’ve got the feedforward layer. You’ve got, you know, the encoder stack and the decoder stack and the latent space in between. Can you just, kind of, walk us through those pieces at a high level and tell us what you think is going on? 

FUSI: Yeah. Yeah, I mean, I have a very opinionated view of why transformers are so great.  

BURGER: That’s why you’re here. [LAUGHS] 

FUSI: Maybe, like, yeah, maybe I’ll inject it. I don’t know. I don’t think it’s a super novel creative opinion, but it is an opinion. So I guess the two principal … the two main components you already described: the, you know, the transformer [read: attention] layers and the feedforward layers. One way to think about them is, how does information in your context relate to each other and what is every token referring to, for instance, in the case of transformers in language models? 

So by context, we mean, like, the information you feed through the model, that the model keeps continuously generating and appending to. 

BURGER: So like your chat history. 

FUSI: Your prompt. Your what? Your chat history or your particular prompt in a chat session.  

BURGER: OK.  

FUSI: That prompt, which is a sequence of words, gets discretized in a series of tokens. Tokens can be individual words, can be multiple words, kind of, connected together. The way we go from words to tokens typically is through an algorithm that tries to basically collapse as much as possible. Multiple words, like “the dog,” may be just one token as a first, kind of, level of compression to feed into the model. So it just tries to bring things together as efficiently as possible.  

Then there is, you know, within these models, there is a transformer layer. This transformer layer or this attention layer, sorry, tries to basically figure out what the “the” refers to—the term “the” in “the dog,” or “the dog jumps on the table,” “jumps” refers to the dog. So there is this kind of, like, mapping that happens.

And then there is, like, feedforward layers, which in modern large language models, they store a lot of information. Like, that’s kind of, like, where the knowledge typically kind of sits in, the things that the model just knows. You know, that, I don’t know, if you slam your arm against [the] cup of water on your table, that cup of water falls off the table. That’s something that the model, kind of, has baked in through reading a lot about cups falling off of tables when they’re hit. 

So that’s, kind of, those are, for me, the two fundamental components, and the reason why I have an opinionated view is that, you know, honestly, I do believe that RNNs and, you know, even state-space—modern incarnations of state-space models—are good enough to learn over these, you know, language data or whatever or vision data or audio data. 

The good thing about transformers is that they do two things very well. One is they get out of the way. They don’t have this notion of “everything has to be encoded through a state” like recurrent networks. And two, they do that very computationally efficiently as you were saying. There isn’t a computational bottleneck. And so they created this nice overhang where they happen to be the right architecture at the right time to unlock enough flow of information through the model … 

BURGER: Yeah.  

FUSI: … that we could get through these amazing things. 

BURGER: Let me press you on one thing. Like, you know, in the attention blocks, you can figure out which words or which tokens relate to which tokens. So I put in the prompt and it’s finding all the relations and then feeding those relations up to, you know, the feedforward layer—well, the feedforward unit within a layer. And you said that knowledge is encoded there, but then what does it really mean for those maps to then access knowledge, but then you project it back into, you know, the output and then feed it up to the attention block in the next layer?  

FUSI: Again, yeah.  

BURGER: So it seems kind of weird that I’d be, like, accessing knowledge and then taking that knowledge, merging it, and going back to another attention map. 

FUSI: Well, you can see it as a mixing operation that happens in the feedforward part of the layer. You know, like, you’re attending, then you’re mixing, and, kind of, like, reprojecting to some space with higher-information content or, like, a different level of information extraction. And then you’re putting it back into, “OK, so let me do another round of processing” and, kind of, attending and then a mix again. And then I do it again and then I do it again.  

So I think that the information that is present in the prompt and in the, you know, that has been baked into the weights gather further and further refined. Whether that refinement is extraction of structure or aggregation into higher-level concepts, I’m not sure. I think it’s just structure gets extracted and things that are irrelevant get kind of pushed away. But that doesn’t necessarily mean that it gets aggregated through the architecture.  

BURGER: So now I’m going to try to, like, restate what I think I hear you saying. So, you know, we’re adding information and we’re kind of adding information at a higher level but not necessarily throwing away the low-level information, at least that’s not relevant, right?  

FUSI: Yeah. 

BURGER: Because, you know, if the higher-level stuff depends on the low-level stuff, I have to have that first. And so then you get to the top of the encoder block and you’re in the latent space with all of that information kind of maximized. Is that a way to think about it? And if you agree, can you talk about what the encoder block really is and what the latent space is? 

FUSI: I tend to agree, yes. I mean, there is … you’re describing … I think you’re describing what I think is happening, which is there is given the context in your prompt and given the task that the model perceives or, like, figures out that you’re doing, it has to highlight and pull out the relevant information. And it does that not by summarizing layer by layer, but it does it by, you know, increasing the prominence of that information and suppressing other things. So I think that’s ultimately what happens up to the point where you reach this beautiful point in concept space, which identifies both your intent and the things in the prompt and in the knowledge of the model that are necessary to solve it. 

BURGER: And so one last question, and then I want to go to Subutai for a second.  

So now when we go through the decoder stack, are we just going the other way and stripping out the high-level concepts early and then getting down to the granular tokens? Or, you know … because you go up through the encoder stack, those attention blocks and feedforward layers, to get to that magical latent space. And now we’re going to go the other direction. How do you think about that other direction through the decoder stack, which is the same primitives as the encoder stack? 

FUSI: Same primitives. You can think of it as kind of the reverse operation. Like you, you never lost information throughout. You just kind of suppress or privileged different kinds of information. And now you’re basically just projecting it back out to a space that is, you know, intelligible. And it’s, kind of, where the model gets it’s … I hesitate to use the term reward because it has a particular implication, but that’s, kind of, where the loss gets computed and then gets pushed back through the model. 

BURGER: Right, as you’re trying to evolve and train all those parameters—the relationship between words, the information in the feedforward layers, the design of that latent space, and the extraction of the knowledge from it. 

FUSI: That’s right. And so in encoder-decoder model, you push through the whole thing, you decode back to a particular token, which for people who don’t know, it’s, like, literally a number out of a vocabulary, like word No. 487. And if it was word No. 1,500, you get, you know, like, … 

BURGER: Something else. 

FUSI: … a bad reward. Yeah. Yeah. And then … and if you got it right, you get a positive signal that then just flows back through the model. 

BURGER: I’d like to go over to Subutai now. So after hearing this, you’ve studied, you know, neuroscience and the neocortex and cortical columns and all of this for a long time, and you and I have had lots of debates. Is the human brain doing something different than that? You know, are we just building latent spaces, then extracting? The architecture is very different, but what’s going on under the hood? 

AHMAD: Yeah, the architecture is very different. You know, as Nicolò was describing what happens throughout a transformer stack, I was trying to relay and relate, you know, what we know in the brain, as well.  

In a typical, you know, transformer model, there is, at the end of the day, there is a single latent space from which the next token is output. That does not happen in the brain. There are thousands and thousands of latent spaces that are, sort of, collaborating together, if you will.  

You know, a lot of what we publish is under the moniker the Thousand Brains Theory of Intelligence. And Jeff has published a book a few years ago on that (opens in new tab). And that, kind of, dates back to discoveries in neuroscience from the ’60s and ’70s by the neuroscientist Vernon Mountcastle (opens in new tab), who was a professor at Johns Hopkins. 

BURGER: Yup. 

AHMAD: And what he discovered … he made this remarkable discovery that, you know, our neocortex, which is the biggest part of our brain—that’s where all intelligent function happens—is actually composed of roughly 100,000 what he called cortical columns (opens in new tab)

BURGER: Right.  

AHMAD: And each cortical column is maybe 50,000 neurons. And there’s a very complex microcircuit and microarchitecture between the neurons in a cortical column.  

But then there’s 100,000 of them, and every part of your brain—whether it’s doing visual processing, auditory processing, language, thought, motor actions—they’re all composed of this, essentially, this same microarchitecture. And this was a remarkable discovery. It says that there’s a universal architecture. It’s not a simple one. It’s complex. But it’s repeated throughout the brain. 

And that’s where this, you know, the idea of the Thousand Brains … each of these cortical columns is actually a complete sensory-motor processing system. It has inputs; it has outputs. It’s getting sensory input. It’s sending outputs to motor systems. And it’s building, in our theory, complete world models. So there isn’t a single latent space. There’s thousands of these latent spaces. 

And each little cortical column is trying to understand its little bit of the world. You know, one cortical column might be getting, at the lowest level, maybe one degree of visual information from the top right-hand corner of your retina. Another one might be focusing on specific frequencies in the auditory range. You know, each one has its own little view of the world, and it’s building its own little world model. 

And then they all collaborate together. There’s no top or bottom here. There’s no homunculus in the brain. Everything is sort of equal. And they’re all simultaneously collaborating and voting and coming up to, you know, what is the, you know, consistent interpretation of all of these sensory inputs that we’re getting? What is the single consistent, you know, concept, if you will, and, based on that, make the motor actions that are most relevant to that. 

So it’s a sensory-motor loop. It’s a, you know, it’s a constantly recurring system; we’re constantly making predictions. As we discussed earlier, you know, we are constantly learning. Every cortical column is constantly updating its connections, constantly updating its weights. It’s building and incrementally improving its world model constantly. So it’s a massively distributed, you know, set of processing elements that we call cortical columns that are, they’re all equal, operating in parallel. 

So I think there are similarities, for sure, between them. But at least the way I described it, I think it’s very different in its operation than what I understand today’s LLMs to be. I don’t know if you agree with that or not. 

FUSI: Yeah, I … To better understand, I had a question, which is, are these cortical columns relying on the fact that these are essentially multiple views of the same process and those multiple views, like, the, you know, the part of the sensory input that gets allocated or subdivided, is it happening at the same time point? So in other words, if you could artificially delay by some time t some cortical columns with respect to the rest, would the learning suffer?  

AHMAD: Yes, absolutely. Yeah.  

FUSI: And so in other words, how important is it that it’s, kind of, on the same schedule? 

AHMAD: [LAUGHS] Yeah, I mean, that’s another … I mean, LLMs today, you know, you get your input, one layer processes it, then the next, then the next, and the other layers are not operating. In the brain, it’s not like that. Everything is operating in parallel asynchronously. And this is important. They’re constantly trying to make predictions and so on. So if you were to artificially slow down some of your cortical columns, you would absolutely suffer. Your thinking would absolutely suffer. 

BURGER: I wanted to interject here just because this is where … this discussion is where, you know, I got super interested in the difference and then spent a bunch of time with Subutai to learn from him. So if I think about my skin, you know, which is an organ, you know, as I understand it, there’s a cortical column attached to each patch of my skin and the size of that patch, kind of, corresponds to the nerve density there.  

AHMAD: That’s right. Yeah. 

BURGER: So in my brain, there is a set of cortical columns that are skin sensors, and I could actually … if I numbered all the cortical columns in the brain, I could draw a map on my skin and say, “This is No. 72 in this patch. This is No. 73 in this patch.” Now are human cortical columns, like, better than, say, what we see in a mouse? And, of course, this is a leading question because I know the answer. 

AHMAD: [LAUGHS] Yeah. So, yes, it, you know, cortical columns in your sensory areas, primary sensory areas, each, you know, pay attention to or get input from a, you know, some patch of your skin somewhere on your body. And there’s many more cortical columns associated with your fingertips than, you know, a square centimeter of your back, for example. So there’s definitely, you know, areas of sensory information that we pay a lot more attention to and devote a lot more physical resources to.  

In terms of a mouse and humans, it’s pretty remarkable that the cortical columns … so all mammals have cortical columns; all mammals have a neocortex. All mammals have cortical columns from a mouse all the way up to humans. And mice have cortical columns that are very, very similar to what a human has. It’s not identical. There are differences. But by and large, the architecture of a cortical column in a mouse is, you know, very, very similar to cortical columns in humans. Human cortical columns are bigger. There are more neurons, and there’s more detail there, but essentially, it’s the same. And …  

BURGER: Maybe just scaled up a little bit.  

AHMAD: Yeah. So evolution basically discovered this structure—that it’s really excellent for processing information and dealing with it—and then through, you know, very fast in evolutionary time, basically figured out that if you could scale up the number of cortical columns, you get more intelligent animals. And that’s what happened very, very fast evolutionarily. 

FUSI: I didn’t know about the unevenness of cortical columns present. Like, this is not … I’m not a neuroscientist, and so this is interesting because one of the biggest frustrations with many modern architectures of models is that they deploy a constant amount of computation no matter what the input is.  

So I go through the same number of layers whether I’m trying to predict the word “dog” after “the” or whether I’m trying to solve, like, give the final answer to a very complicated math question or, you know, whether a theorem was proven or not in the prompt. And so that’s interesting because, like, some current instantiations of modern architecture actually deploy … try to cluster things together such that you have a constant amount of information that you then push together through the model. [LAUGHTER] And so maybe like on my fingertips, I need more processing than I need on my elbow because, like, you know … and so this, kind of, makes sense. 

BURGER: Nicolò is being humble. He was working on this problem two years ago and told me about it. It was one of the things I learned from you that made me think differently. So … 

FUSI: I just like to refer to people are working on this … [LAUGHS] 

BURGER: Random average people who are not all necessarily brilliant AI scientists.  

So the prediction part of this, though, is really what’s fascinating to me, because, again, something else Subutai and I discussed many years ago, you know, if I’m, like, moving my finger towards the table and…my brain is making predictions because I have a world model. It knows a table is there. And the cortical columns representing that patch of skin, as it’s getting closer, they’re starting to predict that I’m going to feel something that feels like the table. And, yup, there; I hit it. Prediction met.  

But if I touched it and it felt really icy cold or super hot or fluffy or not there—I pass through it—I’d get a flurry of activity because the prediction wouldn’t match the world model, and that’s where learning would happen.  

Subutai, does that sound like the right model and intuition?  

AHMAD: Yeah, that’s definitely a very important component of it. We’re constantly making predictions. And as you said, you know, you’re moving your right fingertip down; you know, perhaps you’ve never sat in this room before or, you know, seen this table before, you would still have a prediction, a very good prediction of it. 

BURGER: Yeah. Because you know what a table is. 

AHMAD: You know what a table is. And if it was different, you would, you know, you would notice it right away. But if your left hand, which you weren’t paying attention to, also felt icy cold, then you would notice that, as well. So you’re actually making not just one prediction; you’re making thousands and thousands of predictions constantly about … 

BURGER: Every cortical column. 

AHMAD: Every cortical column is making predictions. And if something were anomalous, highly anomalous, you would notice it. So this is something, you know, we don’t often realize; we’re making very, very granular predictions constantly. And when things are wrong, we do learn from it.  

And the other interesting thing—and this is, again, possibly different from how LLMs work— you know, if I were to tell you to touch the, you know, the bottom surface of the table, you could without, again, without looking at the table or opening your eyes, you would be able to move your finger in and touch the bottom of your table because you have a, you know, set of reference frames that relate to …  

BURGER: Yup … 

AHMAD: There you go. Yep. You’re able to do it. 

BURGER: I did it! Yeah. Amazing. 

AHMAD: Even though you maybe never have been in this room; maybe you’ve never seen this table before. It doesn’t matter. 

BURGER: I’ve been in this room because we had to prep for the podcast series. But I didn’t touch the underside of the table, that’s for sure. [LAUGHS] 

AHMAD: Yeah, exactly. [LAUGHS] So, you know, we know where things are in relation to each other, where our body is in relation to everything, and we can very, very rapidly learn. And again, if the bottom part of the table was anomalous, you would notice it and potentially remember that. 

FUSI: I’m not going to lie. I was expecting you to find something under that table, [LAUGHTER] like a talk show. 

AHMAD: Or chewing gum or something. 

FUSI: And if you reach under the table, you’re going to find a copy of my paper. [LAUGHS] 

BURGER: [LAUGHS] You know, if I was smarter and better prepared, that’s exactly what would have happened. But, sorry, guys.  

I think you told me something, Subutai, you know, that … and I’ll give a little bit of preamble.  

So, you know, the brain has these dendritic networks in each neuron, and they form synapses. And so a neuron fires, and that, you know, the axon of the neuron that’s firing will propagate a signal through the synapses, which might do a little signal processing to the dendrites of the downstream neurons, and those downstream—the dendrites can then prime the neuron to fire. That’s one of the fundamental mechanisms. And it’s the formation of those synapses, you know, between the upstream and downstream neurons, the dendrites, that seem to be the basis of learning, and to me, that feels a little bit like an attention map. 

AHMAD: Yes.  

BURGER: So maybe the dendritic network is doing something akin to self-attention, and we have some work going on in that direction at MSR. But the thing you told me was that your brain is actually forming an incredibly large number of synapses speculatively. In some sense, sampling the world when something happens in case it will recur. You know, it’s a more … maybe it’s a version of Hebbian learning, right? You know, things that fire together, wire together. 

AHMAD: Exactly. 

BURGER: But then if that pattern doesn’t recur, then they get pruned. And I’m just going to, you know, what is the fraction of your synapses to get turned over every three or four days, you know, ballpark? 

AHMAD: OK. Yeah, I remember this. This was an absolute mind-blowing study in [The Journal of] Neuroscience (opens in new tab). So, you know, the way a lot of learning happens in the brain is by adding and dropping connections. 

In AI models, it’s usually strengthening, you know, high-precision floating-point number, making it higher or lower. But you’re not adding and dropping connections. The connections are always—in fact, everything is fully connected, right, between layers. And so in the brain, you’re always adding and dropping connections. That’s a fundamental mechanism by which we learn, one of the fundamental mechanisms.  

What I read in this study is that they looked at adult mice and adult animals, and what they found is that they would look at the number of synapses that were connected over the course of a couple of months—and they were able to trace individual synapses in this particular part of the brain—and what they found is that every four days, 30% of the synapses that were there were no longer there four days from now. And there was a new 30%. And there’s a huge number of connections that are constantly being added and constantly being pruned. And my theory of what’s going on there is that we’re always speculatively trying to learn things. 

So, you know, there’s all sorts of random coincidences and things that we are exposed to on a day-to-day basis. We’re constantly forming connections there because we don’t know what’s actually going to be required and what’s real and what’s random. Most of it’s random; most of it’s not necessary. And the stuff that actually is necessary will stay on. But we’re constantly trying to learn. 

This is a part of continuous learning that’s often not appreciated, I think, is that we’re constantly forming new connections, and then we prune the stuff that we don’t need. In an AI model, if you were to do that, it would just go, I don’t know, it would go bananas. [LAUGHTER]  

BURGER: Well, so let’s double-click on that. So when you told me that, the way I … 

AHMAD: This is mind-blowing, this 30%.  

BURGER: It’s crazy.  

AHMAD: Your brain is going to be totally different a few days from now. 

BURGER: It’s so mind-blowing. When you told me that, I spent some time processing it, so a whole bunch of synapses were created and destroyed during that time.  

But it just made me think that we have, you know, we have all of these columns getting all of this input continuously. You know, eyes, hearing, smell, taste, skin, heat, and then, you know, interactions with people, and then planning and experiences, just at every level. And they’re constantly sampling all this noise coming in and basically filtering out the noise. It’s like, kind of, like a low-pass filter. But when something statistically significant recurs, it’s going to lock and then become persistent.  

AHMAD: Yeah, yeah, I think so. There’s so much that’s happening, and you’re constantly learning, and, you know, when you touch a hot stove or something, there’s a flood of dopamine specific to those areas that caused these synapses to strengthen very, very quickly. You know, most of these synapses that are learned are very, very weak synapses.  

BURGER: Yup. 

AHMAD: And so, yeah, you know, when you look … in this study, they also quantified the turnover in, kind of, strong synapses versus weak synapses. And it’s comforting to know that the strong synapses stay there. It’s really these weak synapses that are constantly added and dropped. And then some of them will become strong. 

BURGER: Now I want to go back … return to Nicolò, but with an observation.   

So when I’m training a transformer, it’s also a prediction-based system. You know, I’m running … I have my input in the training set; I have my masked token or the next token I’m trying to predict. I run it through. I look at how successfully did it make that prediction, and the worse it was, the, sort of, the steeper the error, you know, I drive back through the network. So, you know, if it’s spot-on, I don’t learn very much. But if the prediction is way off, I’ve got to change a bunch of stuff. That sounds analogous to what Subutai was just describing with the cortical columns. 

FUSI: No, that’s right. I mean, with, I don’t know, with one big pet peeve of mine in pretraining, in particular around pretraining these language models.  

BURGER: OK. 

FUSI: So again, for context, like, language models in particular, but, you know, many other instantiations of large models, are trained in a few phases usually. One of them is pretraining, where you have some ground truth text and you remove, let’s say, just the last word, and then you ask the model to predict the last word. And that’s when you get that loss. Do you get the word right? Do you get the word wrong?  

One of the big problems that I have is that, you know, in human experience, we do not get feedback every single thought.  

The problem with language models, the way we are training them, at least in pretraining, is that they do a thing called teacher forcing. So they guess the word, then they get immediately the signal, and then the right word gets filled in, and then they predict the next one. 

So when you go through, like, a passage of text, you constantly get this reward. And it’s such a bizarre way to train a model. It’s necessary because you want a lot of flow of supervision. Like, you want, like, a lot of supervision to essentially use all the computation available. But at the same time, it actually makes the models arguably a little bit worse than what they would be if you had enough compute to train them without this. 

I went on a tangent just because it’s a pet peeve. [LAUGHS]  

BURGER: It’s a really important point, though, because your goal when you’re training a model is to get to your loss target with the minimal cost and time. Or, of course, like, fixed budget and, like, lowest loss target. 

But, you know, biological systems, also, their goal is survival with energy minimization. And so, like, once you’ve built a world model that works, right, like touching the table, touching the underside of the table—nope, still nothing exciting there—like, it takes very little energy to do that. And I think a tragedy is that we all have these supercomputers in our heads. You know, the neocortex is what, about 10 watts? And it’s this amazing thing, right, that can compose symphonies. But once we have a world model, a lot of us just stop learning because it’s comfortable, right. You don’t have to perturb the state. You can go through … and, you know, I mean, how many of us go through every day and all of our predictions succeed [LAUGHTER], and there’s no surprises, you know?  

So all the new synapses get swept away, right. That’s not a goal of pretraining because then you’re just wasting energy. But we’re trying to minimize energy consumption. So it does feel, kind of, aligned to me in some sense. 

So I’ve got a straw man I want to hit you with, but before we do, Nicolò, I want you to talk about your view on compression, like LLMs as compressors, because I know this is something you’re very passionate about and opinionated about. And I’ve learned a lot from you on this, too. 

And then, Subutai, after this, I’d like to hear your biological response. I mean, your response from a biological perspective. [LAUGHTER] And …  

AHMAD: You’ll get both.  

BURGER: That’s right, of course. And then I want to try … I want to throw out this hybrid straw man. So, Nicolò, tell us about compression. 

FUSI: The view is that basically the generative models are compressors in an information theoretic sense, and so trying to come up with a better generative model is equivalent to trying to find the best compressor for some data. And … 

BURGER: Now when you say compressor, do you mean lossless or lossy? 

FUSI: I mean lossless.  

BURGER: OK. 

FUSI: You can basically look at literally my much-maligned objective function that you use for pretraining, which is, you know, next-token prediction, and you can basically draw a complete parallel to what you would do if you were trying to come up with the, you know, try to do compression, which is coming up with the shortest possible code for something that you’re trying to compress. 

And so the two things are the same, and it, kind of, fits into a broader picture that, you know, like, goes back to Occam’s razor and Kolmogorov complexity and Solomonoff’s principle of induction, which is, you want short descriptions for likely things that happen in the world and you want your algorithm that produces those short descriptions to be also short. That’s the minimum description length principle.  

And I do feel like it fits in, kind of, also what you were saying about the concept of you have a good world model, why look for surprise? Because it simultaneously affects both terms, both the algorithm, like your own world model, but also the loss that you incur when something unexpected happens. 

And so if I’m an agent in the world trying to minimize the minimum description length of the world, I’d like to go and seek some in-distribution data such that I don’t bump up my surprise term too much. 

BURGER: Right. And I think you said at some point that, you know, when I’m training a model, even though you took the same loss point, you know, between Model A and Model B, if I have a steeper loss curve in Model A than Model B, you know, it’s getting to a better, sort of, compressed-based vocabulary faster, which makes it more general. The shape of that curve matters from a compression perspective. 

FUSI: Yeah. I mean, I think it would help here to expand on what I was talking about in terms of, … 

BURGER: Yes. Please.  

FUSI: … like, minimum description length principle. The minimum description length principle is basically the loss of the model you’re training; that’s one component. And so it’s a sum over the mistakes you make at predicting or, you know, the mistakes you make at predicting each word. And that’s one term. And the other term is how long it takes you in code to describe the model and the training procedure, … 

BURGER: Right. 

FUSI: … to get to that training curve, to produce that training curve.  

BURGER: Right. 

FUSI: So, yes, if you look at collectively, one term is, kind of, fixed. It’s an amount of code it would take you to write out a language model, for instance, in code. Like, literally implement it, not the weights, just implement the initialization of it and then the training loop. And then on the other side, you have this training loss that gets generated as you start observing data. And, of course, because it’s a sum, you want to minimize really the area, like, you want to minimize the sum. And so, like, a flatter curve is much better than, like, the steeper curve, you know, even if it ends up at the end to be slightly better. 

BURGER: Yeah. Concave is better than convex. 

FUSI: Among other things, yes. [LAUGHTER] 

BURGER: Sorry. So, you know, I think that we could do a whole episode on this compression view because it’s really fascinating. And the lossless part of it is what blew my mind. And I think, you know, I’m guessing there are multiple camps here, and you’re squarely in one camp, so I’m guessing we’ll get a bunch of feedback from the other camps. 

So, Subutai, you know, can I think of cortical columns as compressors? 

AHMAD: Yeah, it’s a good question. You know, I, you know, there’s so much in the compression literature that you can draw insight from. You know, if you look at the representations in cortical columns and that populations that neurons have, you know, some of the things you have to deal with are that the brain doesn’t have a huge nuclear power plant attached to it. 

You know, we only have 12 watts or so to process everything we want to do, and the representations that evolution has discovered are incredibly sparse. And what that means is that you may have thousands and thousands of neurons in a layer, but only about 1% of them will actually be active at a time. And so it’s a very small subset of neurons that are actually active.  

I don’t know about this minimum description length, whether that applies. I can say a couple of things about that. There’s, you know, by and large, the representations are very sparse when you’re predicting well. When you see a surprise, there’s a burst of activity.  

BURGER: Yup. 

AHMAD: When there’s something that’s unusual, there’s a lot more neurons that fire, and … 

BURGER: That’s why learning is tiring!  

AHMAD: That’s why learning [LAUGHTER] … exactly. No, no, that’s right, that’s right.  

And so what we think is happening is that, you know, the actual representation of something is a very small number of neurons. When you’re surprised, there may be many things that are consistent with that surprise, and so your brain represents a union of all of those things at once. 

And when you have a very sparse representation, you can actually have a union of many, many different things without getting confused. So that’s what we think is going on there. So it is a very compressed, very efficient representation. And because it’s such a small percentage of neurons that are firing, we are very, very parsimonious in how we represent things and extremely energy efficient metabolically. 

BURGER: I wanted to get to the efficiency point, but before I do, you know, you talk about this 1, you know, 1 to 2% of the neurons firing. But it’s, actually, the brain is actually much sparser than that at a fine grain, right?  

AHMAD: Yes, yes.  

BURGER: Because, you know, you have 1% of the neurons firing, but they aren’t connected to all the other neurons in the region. 

AHMAD: That’s right. Yeah. 

BURGER: So really the sparsity should be the product of the connectivity fraction times the activity factor. 

AHMAD: Yeah. Yeah. 

BURGER: Right. That’s about one out of 10,000. Something like that. 

AHMAD: Exactly. Yeah. So something like maybe 1% of the neurons are firing at any point in time, and maybe 1% of the connections that are possible are actually there at any point in time. So it’s a very, very small, you know, subnetwork through this massive network that’s actually being activated, a tiny percentage of neurons going through a very, very tiny piece of the full network. 

You know, it’s common to, you know, some people say, “Oh, we’re only using 1% of our brain.” That’s not true. It just means at any point in time, you’re only using 1%, but at other points in time, a different 1% is being used. So, you know, the activity does move around quite a bit. But, any point in time, it’s extremely small. 

BURGER: So, OK, the sparsity, I think, you know, the representation—how the brain is doing this compression biologically—is super fascinating. And I want to go on a little bit of a detour now to efficiency. So I remember in 2017 when in MSR we were building, you know, hardware acceleration for RNNs. 

And then the transformer hit, and they were optimized, you know, to be highly parallelizable across this quadratic attention map for GPUs. The way I would describe it is that that transition to semi-supervised training moved us from an era when we were really data limited, like you had to have good high-quality labeled data, to you were compute limited.  

And when that transition happened, we hockey-sticked from, “I’m building faster machines but I’m limited by data” to the bigger machine I can build, as long as I have enough, you know, unlabeled data of high quality, the better I can do with the model. And so we went on the supercomputing arms race, and now we’re building these, like, just gargantuan machines. 

And really, we’ve kind of been brute-forcing it. I mean, we’ve done a lot of things to optimize, like quantization, you know, and other and, you know, a better process node, you know, a better, more efficient tensor unit design. But to first order, we’ve been training bigger models by building bigger systems.  

And I just wonder, do you think that the brain at this 10 to 12 watts in the neocortex just has a fundamentally more efficient learning mechanism? Or do we think that, you know, what we’re doing in transformers in the most advanced silicon is as efficient, we’re just building much larger, more capable models? 

AHMAD: Oh, I think without a doubt, transformers are extremely inefficient and very, very brute force. We touched on this a little bit earlier in the attention mechanism, where we’re, you know, transformers are essentially comparing every token to every other token. I mean, there are architectures which reduce that, for sure, but it’s essentially an n-squared operation. And we’re doing this at every layer. 

I mean, there’s nothing like that in the brain. Our processing, you know, in some sense, the context for the very next word I’m about to say is my entire life, right? And the amount of time I take to take the next word doesn’t depend on the length of the context at all. It’s a constant time dependence on context. 

So it’s a significant, you know, reduction in the compute that’s required. You can kind of think about, like the brain—I think has somewhere around maybe 70 trillion synapses. When I say the brain, I mean the neocortex, has about 70 trillion synapses. And it’s using only 12 watts. And a synapse is roughly equivalent to a parameter. 

And if you were to take the most efficient GPUs today and try to run a 70 trillion parameter model, it would be something like a megawatt of power. It’s tens of thous … it’s orders of magnitude more inefficient than what our brain is doing. So I absolutely believe that. 

BURGER: The metric I use, to go back to your point, you know, is, this is something, I think we talked about this back in the day, right? When, you know, after this kicked off for a few years, we were trying to project, like, how far would this go under the current model to inform the research and the directions you took. Which is why I got so interested in sparsity and working with you.  

And we would look at a training run and just say, how many joules did it take to train the whole model? How many parameters do we have? And sort of what’s our parameters per joule? And, if by that metric, you know, we were off by many orders of magnitude where the brain is, but I don’t know that that’s the right metric. So any thoughts on that? 

AHMAD: Yeah. I mean, in some ways, you know, transformers, you know, embody more knowledge in them than any human has.  

BURGER: Right.  

AHMAD: It has memorized, you know, the entire internet’s worth of knowledge, essentially. 

BURGER: All scientific papers … 

AHMAD: All scientific papers. You know, good and bad, whatever, you know, it has memorized everything. So that’s something that, you know, humans just cannot do. So there’s definitely stuff that’s better in transformers than humans.  

But fundamentally, I think, you know, we’re extremely efficient in how we process the next token or the next bit of information that’s coming in. And I think there’s a lot we can learn from the brain and apply to LLMs and future AI models there. 

FUSI: I was going to ask a question related to that because … forget memorizing the internet. But let me give you another example that transformers do really well. And I’m wondering, like, you know, the human aspect of this or the brain aspect of this because transformers, because of the n-square computation, they’re really good at stuff, like a needle in the haystack. 

So I can tell you right now, I can speak, I can talk to you, and I can tell you the password is something silly like “podcast microphone blue,” whatever. That’s the password. And then I can proceed and read the entire Odyssey or a bunch of other books to you out loud for the next 5 or 6 hours. And then I can ask the transformer, what was the password? And transformer will do this nice n-square computation many times, and it will spit out the password.  

A human, you know, there will be a decay of that password. And then at some point, it won’t remember, and depending on the human, it may be in the first chapter of the Odyssey or like at the end, but … so fundamentally the type of computation that is done is very different. So it always makes me wonder about the efficiency because it’s just, like, it’s a different type of computation. So the efficiency of … like, efficiency is kind of like, what are you doing divided by how good are you at doing it. And so when the things we’re doing are so incomparable in many ways, that always makes me … always troubles me a little bit. I don’t know… I don’t know if there’s any question in there. [LAUGHTER] 

AHMAD: Yeah. I mean, transformers can do the stuff that humans find very, very difficult to do. Absolutely. You know, maybe there’s a way to get the best of both. I don’t know. You know, I don’t know that it’s fundamentally necessary to have such brute-force computation to get all of these features. 

FUSI: That’s right. 

BURGER: Yeah. Yeah, it is a weird thing because, you know, this is why memory palaces work so well. Like, there is a way, though, for a human to remember that my microphone is gray. It’s not actually blue, Nicolò. 

FUSI: Mine is blue. You don’t see it. It’s off camera. You see, your world model …  

BURGER: It’s off camera. Yeah, I know. I was just teasing you.  

But there’s a way, like, if I can just connect it to enough things, get that connectivity graph, then I’ll remember it because it’s captured the signal out of the noise and connected to enough things I can retrieve it. And retrieval would be a whole other topic we don’t have time to get into today.  

But I do … now, I want to go to the straw man. So let’s take continual learning off the table. Let’s imagine that, as I go through my day, I’m just saving all of the sensory data to put in my training set. And now imagine that I take 100,000 little transformer blocks, and I’m training them each with what they’re seeing. 

OK, I replay the day so I don’t have to, again, I don’t have to worry about continuous learning and whatever cross-cortical column, you know, routing feature of the outputs, the inputs, and there’s—Subutai, we’ve talked about this—there’s a complex set of wiring there to bring features from here to there that gets learned. If I replicated that, could a transformer block kind of do what the cortical columns are doing?

Could I just instrument all my sensory patches with little transformer blocks and then wire them up in the right way and have it work? 

AHMAD: I think there’ll be … there’s still a couple of things we need. One is that cortical columns are fundamentally sensory motor. And so they’re actually, each one, each cortical column is initiating actions, as well. So you cannot have a static dataset fundamentally ahead of time. It’s always a dynamic because we’re constantly making movements to get the next bit of data. And so … 

BURGER: Couldn’t I tokenize that, though? 

AHMAD: I mean, you could tokenize the input and you can tokenize the output, but, you know, if you were to play the same set of inputs back again to a network that … a cortical column that’s randomly wired differently, it may make a different set of actions. And so as soon as it makes the first action that’s different, that dataset is no longer valid, right? It’s, you know, there is … you can’t fundamentally … you have to have a simulation of an environment rather than a static one-way dataset, if that makes sense.  

So I think that’s one piece that I think’s missing in transformers today, is this, sort of, sensory-motor loop. And then the other piece we talked about is continuous learning. 

BURGER: Yeah. 

AHMAD: I guess you said take it off the table, but … 

BURGER: It’s fundamental.  

AHMAD: Fundamental … different. Yeah, yeah. And maybe one other difference. We talked, you know, much earlier about a single latent space and the prediction that’s being made at the top of the transformer that you compute the loss function, and that’s back-propagated through the transformer. That’s not how neurons learn. Neurons are making … every neuron is actually making predictions, and every neuron is getting its input. 

And it’s learning independent of anything that happens at the top. And so it’s a much more granular learning signal. And information does flow from the top to bottom. But there’s also many, many other sources of information that it’s learning from. So it’s different in that sense, as well, mechanistically.  

BURGER: The reason I ask, and now I’d like to get into, you know, some of the … the fun speculation because I’ve just … it’s been a phenomenal discussion with the two. I think we’ve kind of elucidated the differences. Something I’ve wondered after I’ve talked to both of you … and, you know, Nicolò, kind of learning about this compression view of the world, lossless compression, and, Subutai, just, you know, the Thousand Brains Theory and these cortical columns and the sampling of, you know, the world to capture the signal that you can learn from. 

So let’s say that I was able to design a really small, efficient digital cortical column. Maybe it’s transformer-based with some, you know, a sparse representation and some sensory-motor mechanism built in. Maybe it’s more dendritic-based, you know, mapped into digital hardware. And I put a cortical column on every sensor I have in the world, associated with every person, and wire them up together with some of this and then have a, you know, billions of them that can form higher-level abstractions. Like, what do you think would happen? What could we do? 

AHMAD: That’s a fantastic thought exercise, I think [LAUGHS]. You know, again, assuming the cortical column is faithful and can generate, you know, or suggest motor actions, as well. I mean, in some sense, you could potentially have a super intelligent system, right, that’s far more intelligent than anything else on the planet.  

Now we’re scaling the number of cortical columns, you know, not from a mouse, you know, to a hundred thousand columns that a human might have, but potentially billions of cortical columns and way more. And there’s no reason to think there’s any fundamental limit there. So this sort of a system is, I think, the way that superintelligent systems will eventually be built.  

BURGER: But this is a very different direction … 

AHMAD: It’s a very different … 

BURGER: … than the one we’re currently headed down with, like, these monolithic models where we’re doing tons of RL, you know, to capture, you know, to get high-value human collaboration in distribution. 

AHMAD: Yes. It’s completely different than the direction we’re proceeding.  

So I think they, you know, to go down that path, there needs to be a fundamental rethinking of some of our assumptions, potentially even down to the hardware architectures that are necessary to implement it. The, you know, fundamental learning algorithms, the fundamental training paradigm. We talked about, you know, you can’t have a static dataset. You’re constantly moving around in the world and doing things. So it’s a very, very different way of going about AI than what we’re doing today. 

BURGER: Sounds like a great time to be an AI researcher. 

AHMAD: Absolutely. [LAUGHTER] 

BURGER: Nicolò, what was your reaction to that hypothesis? 

FUSI: It sounds super interesting. I mean, my brain was churning. You know, my background is very different. And so, like, I’m in a much worse position to answer this question. But I was starting to think, OK, so let’s say I do this. What would be my loss function? What, you know, how would information flow through the system? Like, sounds like cortical columns would each have their own loss that then I would aggregate—and then I would add a contribution that is, like, higher level. 

And then back to my question. You know, how is the temporal information coordinated? Because one way to see this is that, you know, the way I’m coming to understand this is that it’s kind of like a multi-view framework. 

You have the same phenomena represented to multiple independent, but at the same time, views. And so part of me is like it feels like that you need to tie together these cortical columns in such a way that they all get that gradient feedback if you’re training with gradient-based methods, for instance. And so that’s, kind of, it feels super, super interesting. 

It is related to a lot of, you know, very superficially, to a lot of ideas in machine learning around, hey, is it better to have one giant super deep network? Is it better to have a bunch of shallow networks? But the difference is also in the way you train them, right? We typically train this bunch of shallow networks on kind of the same objective and the same data and not typically into an experiential cycle. Whereas this sounds like this is a different way to do it.  

BURGER: Right, right. I think … I want to pull this back around to the title of the podcast. And so I’ll share an observation. You know, so I’ve been using some of the latest models to code. You know, they’re getting better really fast. I’ve been using them to kind of relearn some of the physics that I never really understood deeply. 

You know, especially in general relativity, like E=MC2. Like, why is C in there at all, right? Just stuff like that. Because now it can actually explain it to me, and I can keep beating at it until I understand it, and then, of course, work. 

And at some point, I asked the model, “Can you describe how I think?” And I was just curious. And it, you know, it gave me a page description that my jaw dropped because I said this, this thing knows me better than I know myself. I don’t think any human being, including me, could have captured kind of the way my approach to learning and my brain works, and I just read it as, like, like, yep, that’s right. And I learned something about myself.  

So I wouldn’t say that it passed the Turing test because this is way beyond Turing test. This was like, this thing knows me way better, you know, than I thought any machine ever could. I mean, I’m having a conversation with it. It could be human, but it’s superhuman. So in some sense, it’s like intelligent beyond human capabilities with its ability to discern patterns in how someone’s interacting.  And yet it’s a tool. You know, it’s not conscious. It doesn’t have agency, embodiment, emotion. It understands a lot of that stuff from the training data. But at the end of the day, it’s a stochastic parrot, right? It’s got, you know, it’s got the weights, and I give it a token, and it outputs a token. So, like, are these machines intelligent or not?  

FUSI: I’ll let Subutai answer first. [LAUGHS] 

AHMAD: OK. You know, you know, it’s definitely a savant, right? It knows a huge amount about the world. It’s absorbed a lot of stuff, and it can articulate that in ways that are just amazing. And, you know, it’s taken your chat history with, you know, presumably thousands of chats and able to summarize that in a way that’s remarkable. 

At the same time, I think, you know, transformers are not intelligent in the way that a three-year-old is, right? A three-year-old human is very curious, is constantly learning. It can learn almost anything. And, you know, a three-year-old Einstein was able to learn and eventually come up with theories that shook the world. That, you know, E=MC2

And so, you know, could a transformer do that? I don’t think so. And so I think there’s still a difference. There’s things it can do that are amazing. But there are still basic things that a child can do that transformers cannot do. So I think there’s still a gap there. Exactly how to articulate it, and how to bridge that gap, is, of course, the trillion-dollar question. But it is bridgeable. And there is a gap today. 

BURGER: Right. Nicolò? 

FUSI: You know, I think, from my perspective, they are intelligent. And from my perspective, I go back to the definition of intelligent, which is like, can you achieve your objectives in a variety of environments? It’s a very basic fundamental, but it’s kind of, you know, it can be embodied, a form of embodied intelligence, an agentic intelligence. If I plop you in an environment, and I give you an objective, can you achieve it? And the wilder the environment, the harder the task is.  

And I do think … I agree with Subutai. Like, there is a jaggedness of intelligence we keep describing.  

BURGER: Yup. 

FUSI: Like these things cannot be simultaneously super good, you know, Olympiad-level mathematicians and still give you stupid answers when you’re trying to, I don’t know, you know, figure out which cable goes where in your … in your car’s battery, you know, like, whatever. 

BURGER: [LAUGHS] Well, then it’s better than me. I’m not an Olympiad-level mathematician, and I do stupid stuff all the time. 

FUSI: I know exactly. Well, you know, whatever that was, that was a bad example. But you get it. But part of it goes back to the compression view. Like, I do believe that intelligence is compression. So the ability to come up with succinct explanations for complex phenomena and even succinct explanations for complex worlds, and then it implies or leads to your ability to operate within them, and the fact that we have these things that they can prove crazy theorems but at the same time fail at fairly rudimentary tasks is a sign that the, yes, transformers are great in terms of inductive biases they put on the world and computation that are great, but we’re ultimately all subject to the No Free Lunch Theorem (opens in new tab)

You know, across the world, the set of tasks that you could be pursuing. You know, you have certain inductive biases that kind of privilege certain tasks at the expense of others. And there isn’t, like, a thing yet that has expanded our set of tasks that are addressable. And so I do think that it’s a matter of rethinking our approach to a few things, whether I think likely both on the architecture front and on the losses and the way we train these systems front. I think there is an opportunity to expand the intelligent frontier of these models. But yeah, from my perspective, they are intelligent already just in a jagged way. 

BURGER: It’s such an interesting question, and I know a lot of people write a lot about this, so I don’t think treading any new ground here. But, you know, there’s the diversity of the tasks you can excel at. You know, are you able to handle nuance and understand things deeply? Are you able to learn continuously? Right now, the systems can’t, right. Are you embodied? I don’t know if that matters. Do you have an objective? Well, we could give them one. Are you conscious? Is that … I mean, that’s a whole other thing.  

So it just feels like there’s a bunch of check boxes, and we’ve checked a bunch of them, and a bunch of them are unchecked. And maybe there’s no consensus on, like, where that threshold is because there are many dimensions of intelligence, and some of which humans don’t even have. 

FUSI: And that’s why we have the term AGI and ASI, and people are debating the G and the S—what is general, what is specialized. So there is, like, it’s a huge discourse, like, for sure. But that’s why we had to start characterizing. But if you go back in the definition, going back to my schooling, go back to the definition of intelligence from Plato and Aristotle and Descartes, like, in some sense, you see the goalpost moving through the centuries around what we define as intelligent.  

BURGER: Right.  

FUSI: And I feel like we are still doing it. 

BURGER: Yeah. We’ll be doing it for a long time, you know, which in AI velocity is probably another like four or five years.  

Hey, I just want to thank you both for the dialogue. You know, I treasure both of you as, you know, intellects and scholars and friends. It was just a joy to nerd out with you all. So thank you both for taking the time. 

AHMAD: Thank you so much, Doug, for having me.  

FUSI: Thank you for having us. This was great. 

[MUSIC] 

STANDARD OUTRO: You’ve been listening to The Shape of Things to Come, a Microsoft Research Podcast. Check out more episodes of the podcast at aka.ms/researchpodcast or on YouTube and major podcast platforms. 

[MUSIC FADES] 

The post Will machines ever be intelligent?  appeared first on Microsoft Research.

]]>
Systematic debugging for AI agents: Introducing the AgentRx framework http://approjects.co.za/?big=en-us/research/blog/systematic-debugging-for-ai-agents-introducing-the-agentrx-framework/ Thu, 12 Mar 2026 16:38:45 +0000 http://approjects.co.za/?big=en-us/research/?p=1163539 As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency. When a human makes a mistake, we can usually trace the logic. But when an AI agent fails, perhaps by hallucinating a tool output or […]

The post Systematic debugging for AI agents: Introducing the AgentRx framework appeared first on Microsoft Research.

]]>
Three white line icons, showing network, workflow, and bug‑analysis icons, on a blue‑to‑purple gradient background.

At a glance

  • Problem: Debugging AI agent failures is hard because trajectories are long, stochastic, and often multi-agent, so the true root cause gets buried.
  • Solution: AgentRx (opens in new tab) pinpoints the first unrecoverable (“critical failure”) step by synthesizing guarded, executable constraints from tool schemas and domain policies, then logging evidence-backed violations step-by-step.
  • Benchmark + taxonomy: We release AgentRx Benchmark (opens in new tab) with 115 manually annotated failed trajectories across τ-bench, Flash, and Magentic-One, plus a grounded nine-category failure taxonomy.
  • Results + release: AgentRx improves failure localization (+23.6%) and root-cause attribution (+22.9%) over prompting baselines, and we are open-sourcing the framework and dataset.

As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency.

When a human makes a mistake, we can usually trace the logic. But when an AI agent fails, perhaps by hallucinating a tool output or deviating from a security policy ten steps into a fifty-step task, identifying exactly where and why things went wrong is an arduous, manual process.

Today, we are excited to announce the open-source release of AgentRx (opens in new tab), an automated, domain-agnostic framework designed to pinpoint the “critical failure step” in agent trajectories. Alongside the framework, we are releasing the AgentRx Benchmark (opens in new tab), a dataset of 115 manually annotated failed trajectories to help the community build more transparent, resilient agentic systems.

The challenge: Why AI agents are hard to debug

Modern AI agents are often:

  • Long-horizon: They perform dozens of actions over extended periods.
  • Probabilistic: The same input might lead to different outputs, making reproduction difficult.
  • Multi-agent: Failures can be “passed” between agents, masking the original root cause.

Traditional success metrics (like “Did the task finish?”) don’t tell us enough. To build safe agents, we need to identify the exact moment a trajectory becomes unrecoverable and capture evidence for what went wrong at that step.

Introducing AgentRx: An automated diagnostic “prescription”

AgentRx (short for “Agent Diagnosis”) treats agent execution like a system trace that needs validation. Instead of relying on a single LLM to “guess” the error, AgentRx uses a structured, multi-stage pipeline:

  1. Trajectory normalization: Heterogeneous logs from different domains are converted into a common intermediate representation.
  2. Constraint synthesis: The framework automatically generates executable constraints based on tool schemas (e.g., “The API must return a valid JSON response”) and domain policies (e.g., “Do not delete data without user confirmation”).
  3. Guarded evaluation: AgentRx evaluates constraints step-by-step, checking each constraint only when its guard condition applies, and produces an auditable validation log of evidence-backed violations.
  4. LLM-based judging: Finally, an LLM judge uses the validation log and a grounded failure taxonomy to identify the Critical Failure Step—the first unrecoverable error.
Flowchart illustrating an agent failure attribution pipeline. In the upper left, a blue rounded box labeled “Task Context” contains three stacked inputs: “Domain Policy,” “Tool Schema,” and “Trajectory.” A downward arrow leads into a large yellow rounded rectangle representing the validation pipeline. Inside this area, a green box labeled “Constraint Generator” feeds into a blue box labeled “Constraint Checker.” To their right is a JSON-like constraint specification with fields such as assertion_name:
The AgentRx workflow: Given a failed trajectory, tool schemas, and domain policy, AgentRx synthesizes guarded constraints, evaluates them step-by-step to produce an auditable violation log with evidence, and uses an LLM judge to predict the critical failure step and root-cause category.

A New Benchmark for Agent Failures

To evaluate AgentRx, we developed a manually annotated benchmark consisting of 115 failed trajectories across three complex domains:

  • τ-bench: Structured API workflows for retail and service tasks.
  • Flash: Real-world incident management and system troubleshooting.
  • Magentic-One: Open-ended web and file tasks using a generalist multi-agent system.

Using a grounded-theory approach, we derived a nine-category failure taxonomy that generalizes across these domains. This taxonomy helps developers distinguish between a “Plan Adherence Failure” (where the agent ignored its own steps) and an “Invention of New Information” (hallucination).

Taxonomy CategoryDescription
Plan Adherence FailureIgnored required steps / did extra unplanned actions
Invention of New InformationAltered facts not grounded in trace/tool output
Invalid InvocationTool call malformed / missing args / schema-invalid
Misinterpretation of Tool OutputRead tool output incorrectly; acted on wrong assumptions
Intent–Plan MisalignmentMisread user goal/constraints and planned wrongly
Under-specified User IntentCould not proceed because required info wasn’t available
Intent Not SupportedNo available tool can do what’s being asked
Guardrails TriggeredExecution blocked by safety/access restrictions
System FailureConnectivity/tool endpoint failures
Two-column taxonomy table with a dark blue header row labeled “Taxonomy Category” and “Description.” The rows define nine agent failure types: Plan Adherence Failure, Invention of New Information, Invalid Invocation, Misinterpretation of Tool Output, Intent–Plan Misalignment, Under-specified User Intent, Intent Not Supported, Guardrails Triggered, and System Failure. Their descriptions explain, respectively, skipped or extra actions, invented facts, malformed tool calls, incorrect reading of tool outputs, wrong planning from misunderstood intent, inability to proceed due to missing information, lack of tool support, blocking by safety or access controls, and connectivity or endpoint failures.
Analysis of failure density across domains. In multi-agent systems like Magentic-One, trajectories often contain multiple errors, but AgentRx focuses on identifying the first critical breach.

Key Results

In our experiments, AgentRx demonstrated significant improvements over existing LLM-based prompting baselines:

  • +23.6% absolute improvement in failure localization accuracy.
  • +22.9% improvement in root-cause attribution.

By providing the “why” behind a failure through an auditable log, AgentRx allows developers to move beyond trial-and-error prompting and toward systematic agentic engineering.

Join the Community: Open Source Release

We believe that agent reliability is a prerequisite for real-world deployment. To support this, we are open sourcing the AgentRx framework and the complete annotated benchmark.

We invite researchers and developers to use AgentRx to diagnose their own agentic workflows and contribute to the growing library of failure constraints. Together, we can build AI agents that are not just powerful, but auditable, and reliable.

Acknowledgements

We would like to thank Avaljot Singh and Suman Nath for contributing to this project.

The post Systematic debugging for AI agents: Introducing the AgentRx framework appeared first on Microsoft Research.

]]>
PlugMem: Transforming raw agent interactions into reusable knowledge http://approjects.co.za/?big=en-us/research/blog/from-raw-interaction-to-reusable-knowledge-rethinking-memory-for-ai-agents/ Tue, 10 Mar 2026 16:00:41 +0000 It seems counterintuitive: giving AI agents more memory can make them less effective. As interaction logs accumulate, they grow large, fill with irrelevant content, and become increasingly difficult to use. More memory means that agents must search through larger volumes of past interactions to find information relevant to the current task. Without structure, these records mix […]

The post PlugMem: Transforming raw agent interactions into reusable knowledge appeared first on Microsoft Research.

]]>
blue and purple gradient background with decorative white icons

At a glance

  • Today’s AI agents store long interaction histories but struggle to reuse them effectively.
  • Raw memory retrieval can overwhelm agents with lengthy, low-value context.
  • PlugMem transforms interaction history into structured, reusable knowledge.
  • A single, general-purpose memory module improves performance across diverse agent benchmarks while using fewer memory tokens.

It seems counterintuitive: giving AI agents more memory can make them less effective. As interaction logs accumulate, they grow large, fill with irrelevant content, and become increasingly difficult to use.

More memory means that agents must search through larger volumes of past interactions to find information relevant to the current task. Without structure, these records mix useful experiences with irrelevant details, making retrieval slower and less reliable. The challenge is not storing more experiences, but organizing them so that agents can quickly identify what matters in the moment.

In our recent paper “PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents,” we introduce a plug-and-play memory system that transforms raw agent interactions into reusable knowledge. Rather than treating memory as text to retrieve, PlugMem organizes that history into structured knowledge designed to support decisions as the agent acts.

Cognitive science offers a useful framework here. It distinguishes between remembering events, knowing facts, and knowing how to perform tasks. Past events provide context, but effective decisions rely on the facts and skills extracted from those events.

This distinction motivated a shift in how we decided to design memory for AI agents. PlugMem implements this shift by converting the agent’s interaction history, such as dialogues, documents, and web sessions, into structured, compact knowledge units that can be reused across tasks.

How PlugMem works

A key difference between PlugMem and conventional AI memory systems is what gets stored. Traditional approaches store text chunks or named entities (references to people, places, and concepts). PlugMem uses facts and reusable skills as the fundamental building blocks of memory. This design reduces redundancy, increases information density, and improves retrieval precision. It’s built around three core components:

Structure. Raw interactions are standardized and transformed into propositional knowledge (facts) and prescriptive knowledge (reusable skills). These knowledge units are organized into a structured memory graph, enabling knowledge to be stored in a form designed for reuse.

Retrieval. Rather than retrieving long passages of text, PlugMem retrieves knowledge units that are aligned with the current task. High-level concepts and inferred intents serve as routing signals, surfacing the most relevant information for the decision at hand.

Reasoning. Retrieved knowledge is distilled into concise, task-ready guidance before being passed to the base agent, ensuring that only decision-relevant knowledge enters the agent’s context window.

Figure 1 illustrates how these components work together.

Figure 1. PlugMem organizes different types of agent interactions into a knowledge-centric memory graph, enabling structured retrieval and reasoning.
Figure 1. PlugMem organizes different types of agent interactions into a knowledge-centric memory graph, enabling structured retrieval and reasoning.

One memory, any task

Most AI memory systems are built for one job. A conversational memory module is designed around dialogue. A knowledge-retrieval system is tuned to look up facts. A web agent’s memory is optimized for navigating pages. Each performs well in its target setting but rarely transfers without significant redesign.

PlugMem takes a different approach. It is a foundational memory layer that can be attached to any AI agent without needing to modify it for a specific task.

Evaluating PlugMem

To test PlugMem, we evaluated the same memory module on three benchmarks that each make different demands on memory:

  • Answering questions across long multi-turn conversations
  • Finding facts that span multiple Wikipedia articles
  • Making decisions while browsing the web

Across all three, PlugMem consistently outperformed both generic retrieval methods and task-specific memory designs while allowing the AI agent to use significantly less memory token budget in the process.

Measuring memory by utility, not size

We wanted to evaluate whether the right information was reaching the agent at the right moment, without overwhelming the model’s context window, which has limited capacity. To do this, we introduced a metric that measures how much useful, decision-relevant information a memory module contributes relative to how much context it consumes.

When we plotted utility against context consumption, PlugMem consistently came out ahead: it delivered more decision-relevant information while consuming less of the AI agent’s context than other approaches, as shown in Figure 2. These results suggest that transforming experience into knowledge—rather than storing and retrieving raw logs—produces memory that is more useful and efficient.

Figure 2. Across all three benchmarks, PlugMem delivered more useful memory with less of the agent’s context window.
Figure 2. Across all three benchmarks, PlugMem delivered more useful memory with less of the agent’s context window.

Why general-purpose memory can outperform task-specific designs

General-purpose memory modules can outperform systems tailored to specific tasks because the decisive factor is not specialization but whether memory can surface the right knowledge precisely when the agent needs it. Structure, retrieval, and reasoning each play a distinct role, and getting all three right matters more than optimizing for a single use case.

PlugMem is not meant to replace task-specific approaches. It provides a general memory foundation upon which task adaptations can be layered. Our experiments show that combining PlugMem with task-specific techniques yields further gains.

Toward reusable memory for agents

As AI agents take on longer and more complex tasks, its memory needs to evolve from storing past interactions to actively supplying reusable knowledge. The goal is for agents to carry useful facts and strategies from one task to the next rather than starting from scratch each time.

PlugMem represents a step in that direction, grounding memory design in cognitive principles and treating knowledge as the primary unit of reuse. As agent capabilities expand, knowledge-centric memory may prove to be a critical building block for the next generation of intelligent agents.

Code and experimental results are publicly available on GitHub (opens in new tab) so that others can reproduce the results and conduct their own research.

The post PlugMem: Transforming raw agent interactions into reusable knowledge appeared first on Microsoft Research.

]]>
Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model http://approjects.co.za/?big=en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/ Wed, 04 Mar 2026 18:05:57 +0000 http://approjects.co.za/?big=en-us/research/?p=1163159 We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning, asking […]

The post Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model appeared first on Microsoft Research.

]]>
White line icons against a blue-green gradient background form an architecture flow chart. In the middle of the chart is a three-by-three matrix of circles and lines within a round-edge square. Above the matrix, three icons in a row – an equation, a person using a desktop, and a head with gears flow by dotted lines to the matrix. To the left of the matrix is an icon representing a stack of files with an arrow pointing to the matrix. To the right of the matrix is a graph with a double headed arrow pointing to the matrix and to itself. Below the matrix is an icon representing a document. A dotted line arrow connects this graph to the matrix, showing the direction flowing from the matrix to the document. To the right of the document icon is an hourglass icon and three list icons with a dotted line connecting the hourglass to the lists.

At a glance

  • Phi-4-reasoning-vision-15B is a compact and smart open‑weight multimodal reasoning model that balances reasoning power, efficiency, and training data needs. It is a broadly capable model that allows for natural interaction for a wide array of vision-language tasks and excels at math and science reasoning and understanding user-interfaces.
  • We share lessons learned and best practices for training a multimodal reasoning model—showing the benefit of careful architecture choices, rigorous data curation, and the benefits of using a mixture of reasoning and non-reasoning data.

We are pleased to announce Phi-4-reasoning-vision-15B, a 15 billion parameter open‑weight multimodal reasoning model, available through Microsoft Foundry (opens in new tab), HuggingFace (opens in new tab) and GitHub (opens in new tab). Phi-4-reasoning-vision-15B is a broadly capable model that can be used for a wide array of vision-language tasks such as image captioning, asking questions about images, reading documents and receipts, helping with homework, inferring about changes in sequences of images, and much more. Beyond these general capabilities, it excels at math and science reasoning and at understanding and grounding elements on computer and mobile screens. In particular, our model presents an appealing value relative to popular open-weight models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require ten times or more compute-time and tokens and better accuracy than similarly fast models, particularly when it comes to math and science reasoning.

Performance charts comparing Phi-4-Reasoning-Vision-15B against other models (Kimi-VL, Qwen-3, Gemma-3) on accuracy vs. response time and accuracy vs. completion tokens. Phi-4 stands out as being fast and token-efficient while achieving ~75% accuracy.
Figure 1: Phi-4-reasoning-vision-15B presents a compelling option compared to existing models, pushing the pareto-frontier of the tradeoff between accuracy and compute costs. We have competitive performance to much slower models that require more time and tokens and higher accuracy than similarly fast models. These values were computed by averaging accuracy, time, and output token-counts for a subset of 4 benchmarks: ChartQA_TEST, MathVista_MINI, MMMU_VAL, and ScreenSpot_v2, where we had logged these values. 

In this post, we share the motivations, design choices, experiments, and learnings that informed its development, as well as an evaluation of the model’s performance and guidance on how to use it. Our goal is to contribute practical insight to the community on building smaller, efficient multimodal reasoning models and to share an open-weight model that is competitive with models of similar size at general vision-language tasks, excels at computer use, and excels on scientific and mathematical multimodal reasoning.

A focus on smaller and faster vision–language models

Many popular vision-language models (VLMs) have trended towards growing in parameter count and, in particular, the number of tokens they consume and generate. This leads to increase in training and inference-time cost and latency, and impedes their usability for downstream deployment, especially in resource‑constrained or interactive settings.

A growing countertrend towards smaller (opens in new tab) models aims to boost efficiency, enabled by careful model design and data curation – a goal pioneered by the Phi family of models (opens in new tab) and furthered by Phi-4-reasoning-vision-15B. We specifically build on learnings from the Phi-4 and Phi-4-Reasoning language models and show how a multimodal model can be trained to cover a wide range of vision and language tasks without relying on extremely large training datasets, architectures, or excessive inference‑time token generation. Our model is intended to be lightweight enough to run on modest hardware while remaining capable of structured reasoning when it is beneficial. Our model was trained with far less compute than many recent open-weight VLMs of similar size. We used just 200 billion tokens of multimodal data leveraging Phi-4-reasoning (trained with 16 billion tokens) based on a core model Phi-4 (400 billion unique tokens), compared to more than 1 trillion tokens used for training multimodal models like Qwen 2.5 VL (opens in new tab) and 3 VL (opens in new tab), Kimi-VL (opens in new tab), and Gemma3 (opens in new tab). We can therefore present a compelling option compared to existing models pushing the pareto-frontier of the tradeoff between accuracy and compute costs.

 A travel blog caption task. Given a photo of Iguazu Falls, the model writes a personal, evocative caption referencing the rainbow, the mist, and the emotional experience.
Restaurant bill splitting. Given a photo of a receipt and instructions about who ordered what, the model calculates each person's share including half the tax, and returns the result as JSON.
Laundry care symbol interpretation. The model correctly identifies all five symbols: machine washable, do not bleach, tumble dry low, iron on low heat, do not dry clean.
Figure 2: Phi-4-Reasoning-Vision can help with a wide range of everyday tasks.

Lessons from training a multimodal model

Training a multimodal reasoning model raises numerous questions and requires many nuanced design choices around model architecture, dataset quality and composition, and the interaction between reasoning‑heavy and non-reasoning perception‑focused tasks.

Model architecture: Early- vs mid-fusion

Model architectures for VLMs differ primarily in how visual and textual information is fused. Mid-fusion models use a pretrained vision encoder to convert images into visual tokens that are projected into a pretrained LLM’s embedding space, enabling cross-modal reasoning while leveraging components already trained on trillions of tokens. Early-fusion models process image patches and text tokens in a single model transformer, yielding richer joint representations but at significantly higher compute, memory, and data cost. We adopted a mid-fusion architecture as it offers a practical trade-off for building a performant model with modest resources.

Model architecture: Vision encoder and image processing

We build on the SigLIP-2 (opens in new tab) vision encoder and the Phi-4-Reasoning backbone. In previous research, we found that multimodal language models sometimes struggled to solve tasks, not because of a lack of reasoning proficiency, but rather an inability to extract and select relevant perceptual information from the image. An example would be a high-resolution screenshot that is information-dense with relatively small interactive elements.

Several open-source multimodal language models have adapted their methodologies accordingly, e.g., Gemma3 (opens in new tab) uses pan-and-scan and NVILA (opens in new tab) uses Dynamic S2. However, their trade-offs are difficult to understand across different datasets and hyperparameters. To this end, we conducted an ablation study of several techniques. We trained a smaller 5 billion parameter Phi-4 based proxy model on a dataset of 10 million image-text pairs, primarily composed of computer-use and GUI grounding data. We compared with Dynamic S2, which resizes images to a rectangular resolution that minimizes distortion while admitting a tiling by 384×384 squares; Multi-crop, which splits the image into potentially overlapping 384×384 squares and concatenates their encoded features on the token dimension; Multi-crop with S2, which broadens the receptive field by cropping into 1536×1536 squares before applying S2; and Dynamic resolution using the Naflex variant of SigLIP-2, a natively dynamic-resolution encoder with adjustable patch counts.

Our primary finding is that dynamic resolution vision encoders perform the best and especially well on high-resolution data. It is particularly interesting to compare dynamic resolution with 2048 vs 3600 maximum tokens: the latter roughly corresponds to native HD 720p resolution and enjoys a substantial boost on high-resolution benchmarks, particularly ScreenSpot-Pro. Reinforcing the high-resolution trend, we find that multi-crop with S2 outperforms standard multi-crop despite using fewer visual tokens (i.e., fewer crops overall). The dynamic resolution technique produces the most tokens on average; due to their tiling subroutine, S2-based methods are constrained by the original image resolution and often only use about half the maximum tokens. From these experiments we choose the SigLIP-2 Naflex variant as our vision encoder.

MethodMax TokensMathVistaScreenSpotScreenSpot-ProV*Bench
Dynamic-S2309642.978.49.452.9
Multi-crop309643.467.85.451.8
Multi-crop with S2204843.479.110.657.1
Dynamic resolution204845.281.59.251.3
Dynamic resolution360044.979.717.556.0
Table 1: Results with different resolution handling approaches. The top two configurations on each benchmark are in bold.

Data: Quality and composition

As with its language backbone Phi-4-Reasoning, Phi-4-reasoning-vision-15B was trained with a deliberate focus on data quality. Our final dataset consists primarily of data from three sources: open-source datasets which were meticulously filtered and improved; high-quality domain-specific internal data; and high-quality data from targeted acquisitions. The overwhelming majority of our data lies in the first category: data which originated as open-source data, which were significantly filtered and improved, whether by removing low-quality datasets or records, programmatically fixing errors in data formatting, or using open-source images as seeds to synthetically generate higher-quality accompanying text.

The process of improving open-source data began by manually reviewing samples from each dataset. Typically, 5 to 10 minutes were sufficient to classify data as excellent-quality, good questions with wrong answers, low-quality questions or images, or high-quality with formatting errors. Excellent data was kept largely unchanged. For data with incorrect answers or poor-quality captions, we re-generated responses using GPT-4o and o4-mini, excluding datasets where error rates remained too high. Low-quality questions proved difficult to salvage, but when the images themselves were high quality, we repurposed them as seeds for new caption or visual question answering (VQA) data. Datasets with fundamentally flawed images were excluded entirely. We also fixed a surprisingly large number of formatting and logical errors across widely used open-source datasets.

We extracted additional value from existing datasets through reformatting, diversification, and using images as seeds for new data generation. We generated detailed image descriptions alongside original QA pairs for math and science data, had data perform “double-duty” by embedding instruction-following requirements directly into domain-specific QA, created “scrambled,” “caption-matching,” and “what’s changed?” records to improve multi-image reasoning and sequential navigation for CUA scenarios, and diversifying prompt styles to encourage robustness beyond perfectly structured questions.

To supplement the improved open-source data, we utilize high-quality internal datasets, several math-specific datasets which were acquired during training of the Phi-4 language model, and also some domain-specific curated data; for example, latex-OCR data generated by processing and rendering equations from arXiv documents.

Top: A pie chart titled before returning a bounding box coordinates for a UI grounding task, and the other uses a tag with step-by-step reasoning to answer a chart question about expatriate populations, concluding with "Dubai." " class="wp-image-1163336"/>
Top: A
Figure 3: Phi-4-reasoning-vision-15B training data composition and examples

Data: Mathematics vs. computer-use data proportion

One of our goals was to train a model that performs well across general vision-language tasks, while excelling at mathematical and scientific reasoning and computer-use scenarios. How to structure datasets for generalizable reasoning remains an open question—particularly because the relationship between data scale and reasoning performance can lead to starkly different design decisions, such as training a single model on a large dataset versus multiple specialized models with targeted post-training.

Research on long-tailed classification robustness has suggested that balancing or removing data from overrepresented tasks or subgroups (opens in new tab) is an effective method for ensuring good performance. Nevertheless, these insights are not fully utilized or explored when it comes to training VLMs, which at times have favored scale over careful data balancing. To achieve our goals, we conducted a set of experiments to analyze a range of data ratios between our focus domains.

Using the same 5 billion parameter proxy model as for previous experiments, we trained while varying the amount of mathematics and science vs. computer-use data for each run. Each dataset included the same subset of 1 million general image-text pairs as a baseline. For mathematics and science data, we used a subsample of 150,000 records, optionally duplicating each one up to three times. Next, we included up to 450,000 computer-use records, and optionally an additional 400,000 from Phi-Ground.

We found that that multimodal mathematics and science performance were not harmed by additional computer-use data, and vice versa. Interestingly, we found that increasing mathematics data by 3x while keeping computer-use data constant improved math, science, and computer-use benchmarks.

GeneralMath and ScienceCUATotalMMMUMathVistaScreenSpot-V2
1M150K450K1.6M44.037.448.2
1M150K850K2.0M44.137.360.0
1M450K450K1.9M45.336.048.3
1M450K850K2.3M43.438.963.1
1M150K150K1.3M44.236.929.8
1M150K250K1.4M45.437.437.7
Table 2: Varying the ratios of math and CUA data. Increasing math data by 3x while keeping computer-use data constant improves both math and computer-use benchmarks. 

Data: Synthetic data for text-rich visual reasoning

Recent work (opens in new tab) suggests that targeted synthetic data can materially improve multimodal reasoning, particularly for text-rich visual domains such as charts, documents, diagrams, and rendered mathematics. Using images, questions, and answers that are programmatically generated and grounded in the visual structure enables precise control over visual content and supervision quality, resulting in data that avoids many annotation errors, ambiguities, and distributional biases common in scraped datasets. This enables cleaner alignment between visual perception and multi-step inference, which has been shown to translate into measurable gains on reasoning-heavy benchmarks.

Synthetic text-rich images expand coverage of long-tail visual formats that are underrepresented in real data but disproportionately impact reasoning accuracy, improving not only visual grounding but also downstream reasoning by ensuring that failures are less often caused by perceptual errors. We found that programmatically generated synthetic data is a useful augmentation to high-quality real datasets — not a replacement, but a scalable mechanism for strengthening both perception and reasoning that complements the training objectives in compact multimodal models such as Phi-4-reasoning-vision-15B.

Mixing non-reasoning and reasoning as a design objective

In language-only settings, reasoning traces have improved performance on many tasks, but they require additional compute which adds undesired latency. In multimodal settings, this tradeoff is less clear-cut, for tasks such as image captioning and optical character recognition (OCR), reasoning is often unnecessary and can even be harmful (opens in new tab), while mathematical and scientific problem-solving benefit from multi-step reasoning. Thus, the choice of when to reason or not can be quite nuanced.

Training approaches for multimodal reasoning models

Language-only reasoning models are typically created through supervised fine-tuning (SFT) or reinforcement learning (RL): SFT is simpler but requires large amounts of expensive reasoning trace data, while RL reduces data requirements at the cost of significantly increased training complexity and compute. Multimodal reasoning models follow a similar process, but the design space is more complex. With a mid-fusion architecture, the first decision is whether the base language model is itself a reasoning or non-reasoning model. This leads to several possible training pipelines:

  • Non-reasoning LLM → reasoning multimodal training: Reasoning and multimodal capabilities are trained together.
  • Non-reasoning LLM → non-reasoning multimodal → reasoning multimodal training: Multimodal capabilities are learned first, then reasoning is added.
  • Reasoning LLM → reasoning multimodal training: A reasoning base is used, but all multimodal data must include reasoning traces.
  • Our approach: Reasoning LLM → mixed non-reasoning / reasoning multimodal training. A reasoning-capable base is trained on a hybrid data mixture, learning when to reason and when to respond directly.

Approaches 1 and 2 offer flexibility in designing multimodal reasoning behavior from scratch using widely available non-reasoning LLM checkpoints but place a heavy burden on multimodal training. Approach 1 must teach visual understanding and reasoning simultaneously and requires a large amount of multimodal reasoning data, while Approach 2 can be trained with less reasoning data but risks catastrophic forgetting, as reasoning training may degrade previously learned visual capabilities. Both risk weaker reasoning than starting from a reasoning-capable base. Approach 3 inherits strong reasoning foundations, but like Approach 1, it requires reasoning traces for all training data and produces reasoning traces for all queries, even when not beneficial.

Our approach: A mixed reasoning and non-reasoning model

Phi-4-reasoning-vision-15B adopts the 4th approach listed previously, as it balances reasoning capability, inference efficiency, and data requirements. It inherits a strong reasoning foundation but uses a hybrid approach to combine the strengths of alternatives while mitigating their drawbacks. Our model defaults to direct inference for perception-focused domains where reasoning adds latency without improving accuracy, avoiding unnecessary verbosity and reducing inference costs, and it invokes longer reasoning paths for domains, such as math and science, that benefit from structured multi-step reasoning (opens in new tab).

Our model is trained with SFT, where reasoning samples include “…” sections with chain-of-thought reasoning before the final answer, covering domains like math and science. Non-reasoning samples are tagged to start with a “” token, signaling a direct response, and cover perception-focused tasks such as captioning, grounding, OCR, and simple VQA. Reasoning data comprises approximately 20% of the total mix. Starting from a reasoning-capable backbone means this data grounds existing reasoning in visual contexts rather than teaching it to reason from scratch.

This approach is not without limitations. The balance between modes is a direct function of design choices we made, informed by recent literature (opens in new tab) and observed model behavior during training—though the boundary between modes can be imprecise as it is learned implicitly from the data distribution. Our model allows control through explicit prompting with “” or “” tokens when the user wants to override the default reasoning behavior. The 20/80 reasoning-to-non-reasoning data split may not be optimal for all domains or deployment contexts. Evaluating the ideal balance of data and the model’s ability to switch appropriately between modes remains an open problem.

We view this mixed approach not as a definitive solution, but as one practical and well-motivated point in the design space for balancing latency, accuracy, and flexibility in multimodal systems.

Applications

A multi-image reasoning example — five Hubble photos of Saturn from 2018–2022, with the query
Figure 4: Phi-4-Reasoning-Vision can interpret sequences of images 

Phi-4-reasoning-vision-15B is a high-performing model across many vision-language tasks. It sees and understands the world by looking at a photo, document, chart, or screen and making sense of it. In practice that covers an enormous range of applications — just a few examples include: describing images and answering questions about them, interpreting changes and trends in images sequences, and recognizing objects, landmarks, and transcribing text.

Highlights: Scientific and mathematical reasoning and supporting computer-using agents (CUA)

In addition to general vision and language tasks, Phi-4-reasoning-vision-15B was designed to excel at tasks that combine visual input with structured inference, such as solving math problems presented in visual form, such as handwritten or diagram-based questions, extracting and reasoning over quantitative information in documents and charts, and supporting multi-step reasoning in educational or scientific analysis contexts.

A physics problem about spring-mass systems, with two diagrams. The model correctly works through the spring constant relationships and arrives at answer B (0.433s).
Figure 5: Phi-4-reasoning-vision-15B is great at math and science 
A handwritten math homework checker. The student made a sign error in the quadratic formula (wrote −8 instead of +8). The model's thinking process catches the error and provides the corrected solution (x = 5 and x = 3).
Figure 6: Phi-4-reasoning-vision-15B can help with written math problems 

In addition, we trained Phi-4-reasoning-vision-15B to have skills that can enable agents to interact with graphical user interfaces by interpreting screen content and selecting actions. With strong high-resolution perception and fine-grained grounding capabilities, Phi-4-reasoning-vision-15B is a compelling option as a base-model for training agentic models such as ones that navigate desktop, web, and mobile interfaces by identifying and localizing interactive elements such as buttons, menus, and text fields. Due to its low inference-time needs it is great for interactive environments where low latency and compact model size are essential.

A GUI interaction task. Given a Windows 11 Start Menu screenshot and the query
A Google Shopping screenshot of heels. The model identifies all black heels, provides bounding box coordinates for each, and suggests outfit pairings (little black dress, tailored suit, jumpsuit).
Figure 7: Phi-4-reasoning-vision-15B can help navigate computer UIs

Evaluation

Phi-4-reasoning-vision-15B was evaluated for accuracy and timing using two complementary open-source frameworks to ensure both rigorous and standardized analysis: Eureka ML Insights (opens in new tab) and VLMEvalKit (opens in new tab).

BenchmarkPhi-4-reasoning-vision-15BPhi-4-reasoning-vision-15B – force nothinkPhi-4-mm-instructKimi-VL-A3B-Instructgemma-3-12b-itQwen3-VL-8B-Instruct-4KQwen3-VL-8B-Instruct-32KQwen3-VL-32B-Instruct-4KQwen3-VL-32B-Instruct-32K
AI2D_TEST 84.8 84.7 68.6 84.6 80.4 82.7 83 84.8 85 
ChartQA_TEST 83.3 76.5 23.5 87 39 83.1 83.2 84.3 84 
HallusionBench64.4 63.1 56 65.2 65.3 73.5 74.1 74.4 74.9 
MathVerse_MINI 44.9 43.8 32.4 41.7 29.8 54.5 57.4 64.2 64.2 
MathVision_MINI 36.2 34.2 20 28.3 31.9 45.7 50 54.3 60.5 
MathVista_MINI 75.2 68.7 50.5 67.1 57.4 77.1 76.4 82.5 81.8 
MMMU_VAL 54.3 52 42.3 52 50 60.7 64.6 68.6 70.6 
MMStar 64.5 63.3 45.9 60 59.4 68.9 69.9 73.7 74.3 
OCRBench 76 75.6 62.6 86.5 75.3 89.2 90 88.5 88.5 
ScreenSpot_v2 88.2 88.3 28.5 89.8 3.5 91.5 91.5 93.7 93.9 
Table 3: Accuracy comparisons relative to popular open-weight, non-thinking models 
BenchmarkPhi-4-reasoning-vision-15BPhi-4-reasoning-vision-15B – force thinkingKimi-VL-A3B-Thinkinggemma-3-12b-itQwen3-VL-8B-Thinking-4KQwen3-VL-8B-Thinking-40KQwen3-VL-32B-Thiking-4KQwen3-VL-32B-Thinking-40K
AI2D_TEST 84.8 79.7 81.2 80.4 83.5 83.9 86.9 87.2 
ChartQA_TEST 83.3 82.9 73.3 39 78 78.6 78.5 79.1 
HallusionBench64.4 63.9 70.6 65.3 71.6 73 76.4 76.6 
MathVerse_MINI 44.9 53.1 61 29.8 67.3 73.3 78.3 78.2 
MathVision_MINI 36.2 36.2 50.3 31.9 43.1 50.7 60.9 58.6 
MathVista_MINI 75.2 74.1 78.6 57.4 77.7 79.5 83.9 83.8 
MMMU_VAL 54.3 55 60.2 50 59.3 65.3 72 72.2 
MMStar 64.5 63.9 69.6 59.4 69.3 72.3 75.5 75.7 
OCRBench 76 73.7 79.9 75.3 81.2 82 83.7 85 
ScreenSpot_v2 88.2 88.1 81.8 3.5 93.3 92.7 83.1 83.1 
Table 4: Accuracy comparisons relative to popular open-weight, thinking models 

Our model balances thinking and non-thinking performance – on average showing better accuracy in the default “mixed-reasoning” behavior than when forcing thinking vs. non-thinking. Only in a few cases does forcing a specific mode improve performance (MathVerse and MMU_val for thinking and ScreenSpot_v2 for non-thinking). Compared to recent popular, open-weight models, our model provides a desirable trade-off between accuracy and cost (as a function of inference time compute and output tokens), as discussed previously.

Note: All numbers here are the result of running benchmarks ourselves and may be lower than other previously shared numbers. Instead of quoting leaderboards, we performed our own benchmarking, so we could understand scaling performance as a function of output token counts for related models. We made our best effort to run fair evaluations and used recommended evaluation platforms with model-specific recommended settings and prompts provided for all third-party models. For Qwen models we use the recommended token counts and also ran evaluations matching our max output token count of 4096. For Phi-4-reasoning-vision-15B, we used our system prompt and chat template but did not do any custom user-prompting or parameter tuning, and we ran all evaluations with temperature=0.0, greedy decoding, and 4096 max output tokens. These numbers are provided for comparison and analysis rather than as leaderboard claims. For maximum transparency and fairness, we will release all our evaluation logs publicly. For more details on our evaluation methodology, please see our technical report (opens in new tab).

Safety

As with other Phi models, Phi-4-reasoning-vision-15B was developed with safety as a core consideration throughout training and evaluation. The model was trained on a mixture of public safety datasets and internally generated examples designed to elicit behaviors the model should appropriately refuse, in alignment with Microsoft’s Responsible AI Principles. For further details, check out our technical report (opens in new tab).

Open release and community engagement

Phi-4-reasoning-vision-15B is available on Microsoft Foundry (opens in new tab) and HuggingFace (opens in new tab) with additional examples and details on GitHub (opens in new tab). For additional guidance on how to use our model properly and safely, please refer to our Model card (opens in new tab). For further details on the technical aspects of the model, training, and evaluation, see our technical report (opens in new tab).

In line with our goal of supporting future AI development in the community, Phi-4-reasoning-vision-15B is released under a permissive license with model weights, fine‑tuning code, and benchmark logs. We intend this release to complement existing work by providing concrete artifacts that help close gaps in understanding how compact multimodal reasoning models can be built and studied.

Looking forward

Smaller vision–language models with selective, task‑aware reasoning offer one promising direction for making multimodal systems more practical and accessible. We present our model and its learnings to inform ongoing research in multimodal modeling, computer‑using agents, and mathematical scientific reasoning. We hope these details are useful to researchers exploring similar tradeoffs and invite critical evaluation, replication, and extension by the community. If you’d like to join us and help shape the future of multimodal models, please apply for one of our open roles.

Acknowledgements

We thank Rachel Ward for her extensive work on data collection and curation. We thank the GenDatasets, PhiGround, SimCity, and Fara-7B efforts for invaluable training data. We thank Harkirat Behl, Mojan Javaheripi, and Suriya Gunasekar for providing us with Phi-4 checkpoints and guidance on training with Phi models. We additionally thank Sahaj Agarwal, Ahmed Awadallah, Qi Dai, Gustavo de Rosa, Rafah Hosn, Ece Kamar, Piero Kauffmann, Yash Lara, Chong Luo, Caio César Teodoro Mendes, Akshay Nambi, Craig Presti, Matthew Rosoff, Corby Rosset, Marco Rossi, Kashyap Patel, Adil Salim, Sidhartha Sen, Shital Shah, Pratyusha Sharma, Alexey Taymanov, Vibhav Vineet, John Weiss, Spencer Whitehead, the AI Frontiers Team and Leadership, and Microsoft Research Leadership, for their valuable help, insightful discussions, and continued support throughout this work.

The post Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model appeared first on Microsoft Research.

]]>
Trailer: The Shape of Things to Come http://approjects.co.za/?big=en-us/research/podcast/trailer-the-shape-of-things-to-come/ Tue, 03 Mar 2026 13:00:18 +0000 http://approjects.co.za/?big=en-us/research/?p=1162826 Microsoft research lead Doug Burger introduces his new podcast series, "The Shape of Things to Come", an exploration into the fundamental truths about AI and how the technology will reshape the future.

The post Trailer: The Shape of Things to Come appeared first on Microsoft Research.

]]>

Technical advances are moving at such a rapid pace that it can be challenging to define the tomorrow we’re working toward. In The Shape of Things to Come, Microsoft research leader Doug Burger and experts from across disciplines tease out the thorniest AI issues facing technologists, policymakers, business decision-makers, and other stakeholders today. The goal: to amplify the shared understanding needed to build a future in which the AI transition is a net positive. 

Transcript

[MUSIC] 

DOUG BURGER: AI is going to reshape the future. I don’t think there’s any question about that now. How we reshape it depends on the choices we make, and so it’s important to understand what we think those shapes are. 

This is The Shape of Things to Come. I’m Doug Burger. I manage Microsoft Research’s worldwide labs, and I’m excited to introduce this new Microsoft Research Podcast series.  

I called the podcast The Shape of Things to Come because as researchers, the problems that we choose to solve and the technologies that we develop do change the shape of the future.  

It’s very hard to say whether we’re in an inflection point because I see the advancement of technology accelerating. But I don’t know what the inflection point is because all I’ve seen is a curve going up. And so I do think this technology at the rate that it’s accelerating—and I think it will continue to accelerate—it offers tremendous promise and potential for the human race. But there are also dangers, and this technology is coming so fast and advancing so fast, it’s very hard to see where it will go.  

My goal for the series is for, you know, the people that choose to listen to come away more informed about where we think AI is headed, to have some of the myths dispelled, to have a deeper understanding of the stack and what’s on the cutting edge and where we think some of the unsolved problems are, and really thinking about what this explosion in intelligence means for humanity going forward.  

STANDARD OUTRO: Check out this Microsoft Research Podcast series and other episodes of the Microsoft Research Podcast at aka.ms/researchpodcast (opens in new tab) or on YouTube and major podcast platforms.  

[MUSIC FADES] 

The post Trailer: The Shape of Things to Come appeared first on Microsoft Research.

]]>
CORPGEN advances AI agents for real work http://approjects.co.za/?big=en-us/research/blog/corpgen-advances-ai-agents-for-real-work/ Thu, 26 Feb 2026 17:06:34 +0000 http://approjects.co.za/?big=en-us/research/?p=1162836 By mid-morning, a typical knowledge worker is already juggling a client report, a budget spreadsheet, a slide deck, and an email backlog, all interdependent and all demanding attention at once. For AI agents to be genuinely useful in that environment, they will need to operate the same way, but today’s best models are evaluated one […]

The post CORPGEN advances AI agents for real work appeared first on Microsoft Research.

]]>
decorative icons in white on a blue and green gradient background

At a glance

  • Today’s AI agent benchmarks test one task at a time, while real workplace productivity requires managing dozens of interdependent tasks at once. To reflect this, we created a setting called Multi-Horizon Task Environments (MHTEs).
  • Under multi-task loads, leading computer-using agents degrade sharply, with completion rates dropping from 16.7% to 8.7%.
  • CORPGEN introduces digital employees, with hierarchical planning, memory isolation, and experiential learning, delivering up to 3.5 times higher completion rates than baselines across three independent agent backends.
  • Because CORPGEN is architecture-agnostic and modular, its gains come from system design rather than any single base model, and it benefits directly as underlying models improve.

By mid-morning, a typical knowledge worker is already juggling a client report, a budget spreadsheet, a slide deck, and an email backlog, all interdependent and all demanding attention at once. For AI agents to be genuinely useful in that environment, they will need to operate the same way, but today’s best models are evaluated one task at a time, not dozens at once.

In our paper, “CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments,” we propose an agent framework that equips AI with the memory, planning, and learning capabilities to close that gap.

Introducing Multi-Horizon Task Environments

Replicating the reality of workplace multitasking requires a new kind of evaluation environment. In response, we developed Multi-Horizon Task Environments (MHTEs), settings where an agent must manage multiple complex tasks simultaneously. Each task requires 10 to 30 dependent steps within a single session spanning five hours.

To determine what a benchmark would need to test, we ran MHTEs at scale on some of today’s leading AI agents, exposing four weaknesses. First, memory fills up. An agent cannot hold details for multiple active tasks at once. Second, information from one task interferes with reasoning about another. Third, tasks don’t depend on each other in simple sequences. They form complex webs where an agent must constantly check whether upstream work is finished before it can move forward on anything downstream. Fourth, every action cycle requires reprioritizing across all active tasks, not simply resuming where the agent left off.

We also tested three independent agent systems under increasing loads. As the number of concurrent tasks rose from 12 to 46, completion rates fell from 16.7% to 8.7% across all systems.

CORPGEN’s architecture

CORPGEN introduces digital employees: LLM-powered AI agents with persistent identities, role-specific expertise, and realistic work schedules. They operate Microsoft Office applications through GUI automation and perform consistently within MHTEs over hours of continuous activity. Figure 1 illustrates how a digital employee moves through a full workday.

Diagram showing a digital employee's workday in three phases. Day Init on the left, where the agent loads memory and generates a daily plan. Execution Cycles in the center, where the agent repeatedly retrieves context, reasons and acts through a ReAct loop, and persists results across 50+ interleaved tasks. Day End on the right, where the agent generates a reflection and consolidates experience into long-term memory. Below the diagram, labels show the tiered memory architecture and experiential learning components.
Figure 1. Each day begins with a structured plan and memory loaded from previous sessions. The agent then works through overlapping tasks in repeated cycles, storing key outcomes at day’s end to inform the next session.

CORPGEN addresses each of the four weaknesses of concurrent task execution—memory overload, cross-task interference, dependency complexity, and reprioritization—in a targeted way. Hierarchical planning breaks objectives into daily goals and then into moment-to-moment decisions, allowing the agent to act from a structured plan instead of reviewing all available tasks before each step.

Subagents perform complex operations like web research in isolated contexts, preventing cross-task contamination. A tiered memory system enables selective recall of task-related information rather than retaining everything in active context. Adaptive summarization compresses routine observations while preserving critical information, keeping memory growth controlled.

Because these mechanisms are not tied to a specific base model, we tested CORPGEN across three different agents. In each case, we observed consistent gains. The improvements came from the architecture, not from the strength of any particular model. Figure 2 shows how they fit together within CORPGEN’s architecture.

Architecture diagram of the CORPGEN framework. At center is the Digital Employee with persistent identity, execution engine, cognitive tools, sub-agents, and context management. On the left, Hierarchical Planning decomposes strategic objectives into tactical plans and operational actions. On the right, Sub-Agents as Tools shows a Research Agent and Computer-Use agent (UFO2) operating in isolated contexts. At the bottom, the Tiered Memory Architecture spans working memory, structured long-term memory, and semantic memory via Mem0. Experiential Learning in the bottom right captures successful trajectories and routes feedback to UFO2. Multi-Employee Collaboration at the top shows async communication via Email and Teams with no shared state.
Figure 2. Four mechanisms support concurrent task execution in CORPGEN: hierarchical planning, isolated subagents, tiered memory, and adaptive summarization.

How digital employees collaborate

When multiple digital employees operate in the same environment, collaboration takes shape through standard communication channels, without predefined coordination rules. One employee sends an email requesting data; another picks it up in the next cycle, uses its memory to process it, and responds. This exchange mirrors real workplace communication.

There is no shared internal state between agents. Coordination occurs entirely through email and Microsoft Teams, the same channels many workers use. Over time, these independent exchanges form recognizable organizational patterns. Some agents take on leadership roles; others provide support; shared documents become the connective tissue.

When a communication path breaks, such as an email delivery error, agents reroute messages through alternate channels to keep work moving. The result is a virtual organization that behaves like a real one without being explicitly programmed to do so.

Evaluating CORPGEN

We evaluated CORPGEN on a multi-task benchmark that combined up to 46 tasks into a single six-hour session. Three findings stood out.

Baselines degrade as load increases; CORPGEN does not. All three baseline agent systems showed steady performance declines as task load rose. CORPGEN, by contrast, maintained or improved its completion rates at higher loads. At 46 tasks, CORPGEN completed 15.2% of tasks, compared with 4.3% for the baselines, roughly 3.5 times more.

Experiential learning drives the largest gains. We introduced CORPGEN’s components sequentially: first the orchestration layer, then cognitive tools, and finally experiential learning. The first two produced moderate improvements. Experiential learning, in which agents store records of completed tasks and reuse them when they encounter structurally similar work, produced the largest increase, raising completion rates from 8.7% to 15.2%.

Evaluation methodology changes the picture. When we inspected the actual output files produced by agents, the results agreed with human judgements roughly 90% of the time. Evaluation based on screenshots and action logs agreed only about 40% of the time. This gap suggests that common evaluation approaches may underestimate what agents actually accomplish in practice.

video series

On Second Thought

A video series with Sinead Bovell built around the questions everyone’s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what’s evolving and what’s possible.

Implications and looking forward

The results suggest that memory and retrieval, not just raw model capability, may be a key bottleneck in getting agents to work in the real world. The largest gains came from experiential learning. Agents that learn from prior successes and apply those patterns to structurally similar tasks build an advantage over systems that respond to each task in isolation.

CORPGEN also opens a new lens on how AI agents collaborate. Next steps include testing whether agents can maintain memory across multiple workdays and how they coordinate when working in teams. We are also exploring ways to make agents faster and more reliable by combining different methods of interacting with software.


Acknowledgments

This work is a result of a collaboration between the Office of the CTO at Microsoft and the Microsoft AI Development Accelerator Program (MAIDAP). We would like to thank the Microsoft Security Research team for providing resources that supported this research. We also thank the members of the Microsoft UFO2 (opens in new tab) team and the Mem0 (opens in new tab) project for their open-source contributions, which enabled key components of the CORPGEN architecture, and the OSWorld team for the benchmark that served as the foundation for our multi-task evaluation.

Finally, we thank the many contributors to this research: Charlotte Siska, Manuel Raúl Meléndez Luján, Anthony Twum-Barimah, and Mauricio Velazco.

The post CORPGEN advances AI agents for real work appeared first on Microsoft Research.

]]>
Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions http://approjects.co.za/?big=en-us/research/blog/media-authenticity-methods-in-practice-capabilities-limitations-and-directions/ Thu, 19 Feb 2026 16:00:51 +0000 http://approjects.co.za/?big=en-us/research/?p=1162092 As synthetic media grows, verifying what’s real, and the origin of content, matters more than ever. Our latest report explores media integrity and authentication methods, their limits, and practical paths toward trustworthy provenance across images, audio, and video.

The post Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions appeared first on Microsoft Research.

]]>
Insights from Microsoft’s Media Integrity and Authentication: Status, Directions, and Futures report

three white outline icons on a blue-to-pink gradient background: an image with a copyright “CR” badge, an image overlaid with fingerprint-like lines, and an image framed by a cropping grid.

It has become increasingly difficult to distinguish fact from fiction when viewing online images and videos. Resilient, trustworthy technologies can help people determine whether the content they are viewing was captured by a camera or microphone—or generated or modified by AI tools. 

We refer to technologies aimed at helping viewers verify the source and history—that is, the provenance—of digital content as media integrity and authentication (MIA) methods. This technique, driven by the Coalition for Content Provenance and Authenticity (opens in new tab) (C2PA), a standards body dedicated to scaling these capabilities, as well as complementary methods such as watermarks and fingerprinting, have become critically important with the rapid advance of AI systems capable of creating, realistic imagery, video, and audio at scale.

A convergence of forces

Our team recognized an inflection point in the evolution of online content integrity, driven by the convergence of four forces:

  • Growing saturation of synthetic media, driven by proliferation of high-fidelity content-generation tools and the explosion of AI generated or modified media online
  • Forthcoming legislation both nationally and internationally seeking to define what “verifiable” provenance should mean in practice
  • Mounting pressure on implementers to ensure authentication signals are clear and helpful, especially as signals increase when legislation goes into effect in 2026
  • Heightened awareness of potential adversarial attacks that attempt to exploit weaknesses in authenticity systems

The usefulness and trustworthiness of provenance signals, whether certifying content as synthetic or as an authentic capture of real-world scenes, will depend not only on advances in technology, but also on how the broader digital ecosystem adopts, implements, and governs these tools. Aligning around implementation choices that promote consistency and clarity is essential to ensure transparency signals strengthen, rather than erode, public confidence.

To address these challenges, we launched a comprehensive evaluation of the real-world limits, edge cases, and emerging “attack surfaces” for MIA methods. Today, we are publishing our findings in the Media Integrity & Authentication: Status, Directions & Futures report. The report distills lessons learned and outlines practical directions for strengthening media integrity in the years ahead.

Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

Findings and directions forward

Our research recognizes that different media integrity and authenticity methods serve differing purposes and offer distinct levels of protection. After defining each method in detail, we focused on secure provenance (C2PA), imperceptible watermarking, and soft hash fingerprinting across images, audio, and video.

Grounded in our evaluation of these MIA methods across modalities, attack categories, and real-world workflows, several new findings emerged including two new concepts:

  • High-Confidence Provenance Authentication: a critical capability for verifying, under defined conditions, whether claims about the origin of and modifications made to an asset can be validated with high certainty.
  • Sociotechnical Provenance Attacks: attacks aimed at deception and capable of inverting signals, making authentic content appear synthetic, and synthetic content appear authentic.

Drawing on our findings, we identified four promising directions for further strengthening media authentication, along with suggestions to support more effective implementation strategies and future decisions. We’ve summarized the findings and directions below, with additional detail available in the report.

Promising directionsHigh-level findings
Delivering high-confidence provenance authenticationImplementation and display choices may affect the reliability of provenance indicators and how they are interpreted by the public.

– Using a C2PA provenance manifest for media created and signed in a high security environment enables high-confidence validation.

– High-confidence validation is also possible across a broader volume of images, audio, and video when an imperceptible watermark is linked to C2PA provenance manifest as an additional layer to recover the provenance information if removed.

– Fingerprinting is not an enabler for high-confidence validation and can involve significant costs when expected at scale. However, it can support manual forensics.
Mitigating confusion from sociotechnical provenance attacksMIA methods are susceptible to sociotechnical attacks on provenance that may mislead the public, resulting in confusion and misplaced trust about an asset’s provenance if there is an overreliance on low-quality signals.

– Layering and linking secure provenance and imperceptible watermarking methods to achieve high confidence validation also offers a promising option to both deter and mitigate the impact of attacks.

– Unintended consequences may result from the use of methods lacking authentication, such as the use of perceptible watermarks in the absence of secure provenance. Perceptible watermarks may cause confusion in cases of forgery or discourage people from consulting high-confidence provenance information via a validation tool, if such perceptible disclosures are taken at face value.

UX design that enables users to explore manifest details—such as where edits occurred or region of interest—has the potential to reduce confusion and support forensics and fact checking efforts.  
Enabling more trusted provenance on edge devices– High-confidence results aren’t feasible when provenance is added by a conventional offline device (e.g., camera or recording device without connectivity).

Implementing a secure enclave within the hardware layer of offline devices is essential to make the provenance of captured images, audio, and video more trustworthy.
Investing in ongoing research and policy development– All three methods offer organizations valuable tools for addressing operational challenges such as fraud prevention, risk management, and digital accountability. 

UX and display are promising directions for research. Important directions include in-stream tools that display provenance information where people are and distinguish between high- and lower-confidence provenance signals.

Stakeholders should conduct ongoing analysis and red teaming to identify and mitigate weaknesses through technical approaches, policies, and laws.   

The journey continues

This report marks the beginning of a new chapter in our media provenance journey (opens in new tab), building on years of foundational work, from developing the very first prototype in 2019 to co-founding the C2PA in 2021 and helping catalyze an ecosystem that has since grown to more than 6,000 members and affiliates (opens in new tab) supporting C2PA Content Credentials. This research represents the next evolution of that long‑standing commitment.

We hope that by sharing our learnings will help others prepare for an important wave, especially as generative technologies accelerate and provenance signals multiply. This work is already underway across our products at Microsoft. Together, these directions highlight opportunities for the ecosystem to align, harden, and innovate, so authentication signals are not merely visible, but robust, meaningful, and resilient throughout the content lifecycle.

The post Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions appeared first on Microsoft Research.

]]>