AI Frontiers Articles http://approjects.co.za/?big=en-us/research/ Wed, 08 Apr 2026 21:23:13 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 Memento: Teaching LLMs to Manage Their Own Context http://approjects.co.za/?big=en-us/research/articles/memento-teaching-llms-to-manage-their-own-context/ Wed, 08 Apr 2026 20:18:08 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1168112   Vasilis Kontonis, Yuchen Zeng, Shivam Garg, Lingjiao Chen, Hao Tang, Ziyan Wang, Ahmed Awadallah, Eric Horvitz, John Langford, Dimitris Papailiopoulos We taught models to compress their own chain-of-thought mid-generation. Peak KV cache drops 2–3x, throughput nearly doubles, and the erased reasoning blocks leave traces in the KV cache that the model still uses. Paper, […]

The post Memento: Teaching LLMs to Manage Their Own Context appeared first on Microsoft Research.

]]>
 

Vasilis Kontonis, Yuchen Zeng, Shivam Garg, Lingjiao Chen, Hao Tang, Ziyan Wang, Ahmed Awadallah, Eric Horvitz, John Langford, Dimitris Papailiopoulos

diagram

We taught models to compress their own chain-of-thought mid-generation. Peak KV cache drops 2–3x, throughput nearly doubles, and the erased reasoning blocks leave traces in the KV cache that the model still uses. Paper, OpenMemento dataset (228K traces), and vLLM fork all open.

If you’re too busy to read this, here’s what we found:

  1. You can teach a model to segment its own chain-of-thought into blocks, compress each into a dense memento, and reason forward from that. Standard SFT on ~30K examples suffices to teach this to a model.
  2. This cuts peak KV cache by 2–3× and nearly doubles serving throughput, with small accuracy gaps that shrink with scale and close with RL.
  3. Erased blocks don’t fully disappear: their information leaks forward through the KV cache representations, forming an implicit second channel without which accuracy drops significantly.
  4. We are releasing OpenMementos (228K annotated traces built on top of OpenThoughts-v3), the data generation pipeline, and a vLLM fork with native block masking.

The Problem: LLMs Don’t Know How to Manage their Context

It’s well established at this point that reasoning models can solve hard problems by generating a lot of tokens. Test-time compute works and has led to dramatic advances on competition-level math and coding, but it can also result in a single inference call producing hundreds of thousands of tokens. That is roughly the length of a book. All these tokens stay in memory, attended to at equal cost, whether they lead somewhere or not. The model has no built-in mechanism to compact what it has figured out, keep the conclusions, and move on.

There are ways to manage this externally, e.g., by running a separate summarizer, restart API calls with condensed context, build orchestration logic around the model. However, these are all systems bolted around the model rather than skills the model itself has learned. We think figuring out what to remember and what to forget can and should be a skill that the model learns during training.

Memento teaches language models exactly this. A Memento-trained (aka a mementified) model segments its reasoning into semantically coherent blocks. When a block is complete, the model produces a memento: a terse, information-dense compression of the block’s conclusions, key intermediate values, formulas, and strategic decisions. Think of a memento as a lemma: a minimal record of what future reasoning steps need to continue.

Once a memento is generated, the preceding thinking block is masked from attention and its KV cache entries are flushed away. From that point on, the model sees only past mementos plus whatever block it is currently working through. This means context grows while the model is reasoning through a block, but then it drops sharply once the memento is produced and the block is evicted. This gives rise to a sawtooth pattern where peak memory stays at a fraction of what a standard flat CoT trace would require. Here’s what this looks like:

Importantly, all of this happens within a single generation call, with no restarts, separate summarizers, or orchestration layers involved. The model segments, compresses, and masks its own reasoning by itself.

We applied Memento to five models: Qwen2.5-7B, Qwen3 8B and 32B, Phi-4 Reasoning (14B), and OLMo3-7B-Think. It works across all of them. Peak KV cache drops by 2-3x with small accuracy gaps that shrink with scale and close further with RL.

Bar charts comparing accuracy across benchmarks and peak KV cache in GB for Qwen3-8B, Phi-4-r, and Qwen3-32B

We also found something we did not anticipate: the erased blocks, although physically removed from the KV cache, don’t fully disappear from the model’s representations. More on this in a minute!

But before that, how do we train context management in a model?

How do you teach context management? Add it in the training data!

Teaching this behavior requires training data that isn’t quite common: large-scale, high-quality reasoning traces segmented into blocks, each paired with a memento that captures the block’s conclusions in a way the model can reason forward from. The intuition is straightforward: if you take reasoning traces, segment them, add proper summaries, and SFT on the result, maybe the model learns to do context management on its own.

It sounds simple, but as with many things there were several components that broke along the way and had to be fixed.

First, we decided to build on top of OpenThoughts (opens in new tab): reasoning traces generated by QwQ-32B that are already reasonably high-quality and widely used by the community, which saves us from generating everything from scratch. Now the question is: how do we go from raw traces to segmented, annotated ones with mementos at each block boundary? The challenge is that reasoning traces have no natural segment boundaries, i.e., ideas mix together, calculations span multiple sentences, and where to “cut” the CoT depends much more heavily on meaning rather than formatting, or some other obvious indicator.

We tried the obvious thing first: paste a trace into a frontier model and ask it to segment and summarize directly. This does not work! Not even if you cut the trace into pieces first, because you don’t quite know where to cut. Finding good partitions requires simultaneously reasoning about block coherence, size balance, and semantic boundaries, which is a tricky combinatorial optimization that LLMs (at least the ones we tried) struggle to do in one shot.

So we factored the problem into parts. First, we segment each trace into atomic units—sentences, code blocks, math equations—that can’t be meaningfully split further. Then an LLM scores each inter-sentence boundary from 0 (mid-thought, would break flow, i.e., bad) to 3 (major transition, natural stopping point, i.e., good). This is a local question and LLMs handle local questions very well. The global optimization of where to actually place boundaries given these scores is then handled by dynamic programming, which maximizes boundary quality while penalizing uneven block sizes. This is the kind of thing that’s (again, in our experience) hard for an LLM to zero-shot, but where good old dynamic programming just works.

Once we have our segmented traces, we now need to compress each block. A compressor LLM produces a memento for each one, and we explicitly explain in the prompt that the task is not vanilla summarization but state compression: produce something compact enough that the model could continue reasoning from the memento alone, without ever seeing the original block. And so, a memento is born!

Then, a separate judge LLM evaluates each memento across six dimensions (formulas extracted, values preserved, methods named, validation included, no hallucinations, result-first structure) and if the score falls short, the judge provides specific, actionable feedback (not “more details needed” but “missing formula: K² − 3K + 3” etc) and the compressor retries.

This iterative refinement turned out to be crucial. Single-pass compression barely hits a 28% pass rate on our rubric, because initial mementos typically miss exact formulas or intermediate values that downstream blocks depend on. Two rounds of judge feedback bring the pass rate to 92%.

Note: For all LLM calls in the pipeline we used GPT-5.x, but any sufficiently capable model should work. The full pipeline is open and we hope people use it, improve it, and build better datasets than ours.

Here is what the data gen pipeline looks like:

Memento data generation pipeline: trace selection, sentence splitting, boundary scoring, DP segmentation, and memento compression

The final dataset, OpenMementos, contains 228K annotated traces consisting of 54% math, 19% code, 27% science problems. We measured that mementos resulted in roughly 6x trace-level compression: about 11k tokens of reasoning compacted to under 2k tokens of mementos per trace.

Here are some cool compression statistics on OpenMementos.

Distribution plots of blocks per sample, block size, summary size, and compression ratio across math, code, and science in OpenMementos

Training: How do we put pressure on the model?

We have annotated traces with block structure and mementos. The obvious next step is: let’s SFT on them; but how exactly? The goal is for the model to eventually reason forward from mementos alone, with the original blocks masked and their KV cache entries removed.

One option is to train with normal causal attention on the annotated traces and then just mask the blocks at inference time. This works to some extent, but it means training never puts any pressure on the model to actually pack information into its mementos, as it can always fall back on attending to the full block during training, and then at inference it’s basically on its own.

We want training to match inference: if blocks will be masked when the model is deployed, they should be masked during training too.

But training directly with block masking from the start also does not work well. The model is trying to learn three things at the same time: the block-memento format, how to compress under hard constraints, how to only rely on mementos for the next block generation. And it struggles with all three simultaneously.

What we found is that curriculum matters a lot.

Stage 1 uses standard causal attention with loss on all tokens. The model learns the format: when to end a block, how to write a memento, what structure looks like. It can still see everything, so there’s no compression pressure yet. Stage 2 then introduces the hard constraint: after each memento, the preceding thinking block is fully masked from subsequent attention. Now the model has to produce mementos that carry everything future reasoning needs, because the original blocks are gone. This is where the real learning seems to occur: the model is forced to pack more information into its mementos, almost like an RL-style pressure signal that pushes toward self-contained compression. Very cool!

Bar charts showing multi-stage SFT ablation results on AIME 2024, AIME 2025, and GPQA-Diamond for Qwen2.5-7B

Multi-stage SFT ablation on AIME 2024, AIME 2025, and GPQA-Diamond (Pass@1, n=8, Qwen2.5-7B). OT = OpenThoughts only; OM/Full = OPENMEMENTOS Full Attention; OM/Mem = OPENMEMENTOS Memento Attention; 2-Stg = OT → OM/Full; 3-Stg = OT → OM/Full → OM/Mem (Ours). Training directly on OPENMEMENTOS from the base model (OM variants) substantially underperforms vanilla SFT (OT). Our three-stage pipeline enables block masking while retaining strong performance.

We also found, consistent with other work on teaching skills, that training on a small subset of the OpenMementos data is enough. Even ~30K samples from the 228K pool, trained for 5 epochs per stage at 32K sequence length, were sufficient for models to pick up the skill.

For models that already reason well (Qwen3, OLMo3, Phi-4-reasoning), two stages suffice; non-reasoning base models like Qwen 2.5 7B need a preliminary round of standard reasoning SFT first. Memento doesn’t require qualitatively more data than standard reasoning SFT, it just requires different kind of data.

Line charts showing pass@1 accuracy scaling from 1K to 100K training examples on AIME24 and AIME25

Training data scaling. Pass@1 accuracy on AIME24 and AIME25 for Qwen2.5-7B-Instruct fine-tuned on 1K–100K examples. All methods improve monotonically with data size.

How does compaction affect accuracy?

The first obvious concern with Memento is that attending to fewer tokens should hurt accuracy. And when we first looked at the numbers, there was indeed a drop. Where does it come from?

Our initial reaction was that it must be due to compaction and sparsity: the model is seeing far less context, so of course it gets worse. But then we ran control studies, and the picture turned out to be more interesting.

The key insight is that we train on OpenThoughts traces generated by QwQ-32B, which is a different and often weaker model than the ones we are fine-tuning. Several of our target models were released after QwQ and are arguably stronger. So we ran a control: take each base model, SFT it on the same raw OpenThoughts traces (no block structure, no mementos), and measure the accuracy drop from that alone. It turns out that just doing SFT on another model’s reasoning traces already costs you something. When we compare Memento against that control rather than the untouched baseline, the additional drop from compression is small, and in some cases negligible.

Table comparing accuracy, peak KV, and AUC KV for base, control, Memento, and Memento plus RL on Qwen3-8B

But we were still curious about whatever accuracy gap remained. So we asked: can the model still solve the same problems?

To test that, we generated 64 completions per problem across all three model families on AIME 2024/25/26, and the answer is overwhelmingly yes. The overlap between the problems solved by the base model and by Memento averages 96.4%, hitting 100% in some settings. The model retains the capability to solve these problems and what drops is the consistency of solving them on any single attempt.

This is an important distinction because it likely implies the gap is closable. For example we found that even majority voting at k=3 is enough for the Memento model to match not just the control but the original baseline. This confirms that the capability is still there in the distribution.

Line chart showing Memento models matching base model accuracy with majority voting at small k on AIME 2026

The natural next step was RL. And unsurprisingly it works: fine-tuning the Qwen3-8B Memento checkpoint with CISPO recovers AIME’26 and GPQA-Diamond scores (sometimes actually exceeding the vanilla baseline), while the KV savings remain substantial after RL.

Bar charts comparing base, Memento, and RL-finetuned accuracy and peak KV across three model families

Scale also helps independently, even without RL. Going from Qwen3-8B to 32B, the gap shrinks considerably even though both models are trained on the same QwQ-32B traces: the larger model handles the distribution mismatch and the effects of compression more gracefully.

So the bottom line is: compression preserves capability, any consistency loss traces primarily to training data mismatch rather than a fundamental limitation, and both RL and scale close the gap further.

RL training and validation accuracy curves for mementified Qwen3-8B showing convergence to 66.2% on AIME25

The Dual Information Stream

Early in the project, there were a lot of discussions about how inference should actually work. The simplest approach, and the one that would make our lives much easier, is restarts: every time a memento is produced, kill the KV cache and start a fresh API call with just the accumulated memento text. No need to implement non-causal sparse attention inside vLLM, which turned out to be a huge pain. Just restart the call.

But we kept coming back to a concern: when a memento is generated, the model can still see the full thinking block and therefore memento tokens attend to block tokens during their own generation. The block is only masked after the memento is complete. This means the KV cache entries for the memento were computed as a function of the block’s content. So even after the block text is gone, something from it survives in the memento’s KV representations. If the next block attends to the memento, it’s attending in an indirect way to this implicit, soft representation of what came before. In a restart setup, you throw all of that away.

That is to say, there is non-trivial information about masked blocks that survives in the KV cache representations, beyond what the actual memento tokens capture. So a question kept bothering us: does this implicit information channel actually matter for accuracy, or are we overthinking it?

And so we ran an ablation. Take the same Qwen3-8B checkpoint, compare normal Memento inference (mask blocks but keep the memento KV states intact) against restart mode (recompute the entire KV cache from scratch at each memento boundary, so the mementos themselves never attended to their blocks). The restart mode drops AIME’24 from 66.1% to 50.8%. Fifteen percentage points which to any reasonable observer does not register as noise. It in fact strongly suggests that the side information channel flowing through the KV representations matters a great deal. Just to be sure, we wanted to test this hypothesis further.

So we designed a simple experiment: take a model, inject a random 5-digit passcode into a target block, mask that block, and train linear probes on the KV states of downstream mementos that never directly attended to the masked block. Can you recover a piece of information that exists only in the implicit KV channel, not in any memento text?

Oh, yes, yes you can! The probes reconstruct the passcode well above chance, precisely because of information leakage that happens through the KV states. This leakage concentrates in deeper layers, decays with distance from the target block, but remains detectable even seven blocks away, and scales with model capacity.

KV leakage probe figures:

We also verified this on a small controlled toy transformer (4 layers, 810K parameters), where the leakage is constant across training checkpoints even as task accuracy improves from 77% to 95%. This is an architectural consequence of residual connections, causal attention, and in-place masking.

We believe this distinguishes Memento from approaches like InftyThink and Accordion-Thinking, which discard original tokens and rebuild context from summary text alone and lose this implicit channel entirely. And it is what convinced us to do the hard infrastructure work of implementing proper block masking inside vLLM rather than taking the infra-wise easier restart path.

Making Memento work in vLLM

Memento’s block masking is data-dependent and keeps changing during generation, since which tokens to mask depends on what the model produces. No production inference framework supported this out of the box, unfortunately. We started with a HuggingFace backend, which was enough to validate that block masking and keeping everything in a single inference call actually helps, but once we were convinced, it was clear we needed to build this properly inside vLLM.

That turned out to be painful but, in the end, doable. The key design choice was physical KV cache compaction rather than logical masking: when a block completes, its KV entries are physically flushed and the freed slots are returned to the KV pool. This means standard FlashAttention and paged-attention kernels work completely unmodified as they never see the evicted tokens. The implementation operates purely at the vLLM Python level and can be installed as a patch on top of an existing vLLM installation.

On a single B200 GPU with 240 concurrent requests (Qwen3-8B, 32K max tokens), Memento sustains 4,290 tok/s versus 2,447 for vanilla (1.75× throughput) and completes the batch in 693s versus 1,096s. The gains come from freeing KV entries as blocks complete, allowing the engine to sustain higher batch sizes in regimes where vanilla vLLM becomes KV-cache-bound.

This infrastructure also turned out to be essential for RL: generating 32K-token training rollouts requires block masking during generation, with each rollout producing and compacting blocks on the fly. Without the vLLM fork, RL at this scale would not have been feasible.

What’s next?

Two things seem natural from here. First, scaling the RL recipe: our results with Qwen3-8B are early, and the pass@64 analysis makes it clear there is a lot of headroom for improvement. Larger models with more RL compute should take us to interesting places.

Second, and more importantly to us: agents. Memento was built for mathematical, coding, and science reasoning as a test case, not because we think single-turn math and coding are the most interesting applications. The block-and-compress pattern maps onto any setting where a model accumulates a long trajectory of intermediate state and limited context windows become the bottleneck. Terminal and CLI agents are naturally multi-turn, where each action-observation cycle is laid out as a natural block, and the ability to selectively remember and forget is exactly what seems missing (at least from OSS models/agents). Recent work on context compaction in agentic settings (e.g., from Anthropic and OpenAI) points in the same direction, and we think there is a ton of room to explore here.

Coda

Memento started as an attempt to teach models to compact their own reasoning. That indeed works: 2-3x KV reduction, accuracy largely preserved while throughput nearly doubled. But we came away from this project with two insights that feel more important than the efficiency gains.

The first is that context management can be taught through standard training on the right data. A model that had no concept of blocks or summaries can, after SFT on ~30K examples, learn to segment its own reasoning, compress each segment, and continue from the compressed version. This is a non-trivial, non-causal skill involving sparse attention, selective forgetting, state compression, that was acquired through entirely conventional training. We think there in fact is a much wider space of unconventional capabilities that can be taught this way.

The second is the dual information stream supported by both hard tokens and their KV representations. When you mask a block inside a single forward pass, the block’s information doesn’t quite vanish: it persists in the KV representations of the mementos that were computed while the block was still visible. This is both useful and architecturally unavoidable, and we don’t yet know how far this implicit channel can be pushed, especially with RL.

These two pieces point in the same direction: memory management should be a learned capability, and models can learn with less effort than we expected.

We think Memento is a first step, and there is a long way to go, with better training data, stronger RL, and agent applications. We are continuing work across all of these, and along the way we are releasing OpenMementos (228K annotated reasoning traces), our full data generation pipeline, and the vLLM fork with native block masking.

In the meantime, stop flushing your KV cache. Your model remembers more than you think.

Paper (opens in new tab) · Code (opens in new tab) · Dataset (opens in new tab)

The post Memento: Teaching LLMs to Manage Their Own Context appeared first on Microsoft Research.

]]>
Phi-Reasoning: Once again redefining what is possible with small and efficient AI  http://approjects.co.za/?big=en-us/research/articles/phi-reasoning-once-again-redefining-what-is-possible-with-small-and-efficient-ai/ Tue, 08 Jul 2025 21:33:29 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1140974 Phi-4-reasoning is a 14-billion parameter model specialized in complex reasoning tasks. It is trained using supervised finetuning (SFT) on diverse prompts and reasoning demonstrations from o3-mini. The model generates detailed reasoning chains and leverages inference-time compute effectively. Phi-4-reasoning-plus, an enhanced version with reinforcement learning (RL), delivers even higher performance by generating longer reasoning traces.  Despite […]

The post Phi-Reasoning: Once again redefining what is possible with small and efficient AI  appeared first on Microsoft Research.

]]>
Phi-4-reasoning is a 14-billion parameter model specialized in complex reasoning tasks. It is trained using supervised finetuning (SFT) on diverse prompts and reasoning demonstrations from o3-mini. The model generates detailed reasoning chains and leverages inference-time compute effectively. Phi-4-reasoning-plus, an enhanced version with reinforcement learning (RL), delivers even higher performance by generating longer reasoning traces. 

Despite their smaller size (14B parameters), Phi-4-reasoning and Phi-4-reasoning-plus are competitive with or exceeding much larger open weight (QwQ-32B, DeepSeek R1- Distill-Llama-70B, DeepSeek-R1) and closed (o1-mini, Claude Sonnet 3.7) reasoning models across several benchmarks as shown in Figures 1, 3 and Tables 1, 2. Our extensive benchmarks span math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. 

igure 1. Performance comparison on representative reasoning benchmarks spanning mathematics (HMMT, AIME 25, OmniMath), scientific (GPQA), and coding (LiveCodeBench 8/24-1/25) domains.

Notably, Phi-4-reasoning and Phi-4-reasoning-plus achieve better performance than o1-mini, and DeepSeek-R1-Distill-Llama-70B at most benchmarks and achieve performance comparable to the full DeepSeek-R1 model (with 671B parameters) on AIME 20251 (the 2025 qualifier for the USA Math Olympiad). They also outperform Claude 3.7 Sonnet and Gemini 2 Flash Thinking on all tasks except GPQA (PhD-level STEM questions) and Calendar Planning.

More Potential with Parallel Test-time Scaling: As shown in Figure 2, our small-ish model nearly saturates performance on AIME 2025 with increasing parallel test-time compute (e.g., Majority @N), surpassing the pass@1 of the teacher (o3-mini). 

Figure 2: Effects of parallel test-time compute on AIME 2025

Average Pass@1 accuracy

Key contributors to best-in-class performance 

Below we summarize the core contributions that led to the superior performance of Phi-4-reasoning models. We provide more comprehensive technical details and experimentations surrounding each bullet point in our tech repot [1]. 

  • Careful Data Curation: our reasoning prompts are specifically filtered to cover a range of difficulty levels and to lie at the boundary of the base model capabilities. Our approach aligns closely with data-centric methods of earlier Phi and Orca models [2,3,4,5,6,7,8], demonstrating that meticulous data curation and high-quality synthetic datasets allow smaller models to compete with larger counterparts. The datasets used in supervised finetuning include topics in STEM (science, technology, engineering, and mathematics), coding, and safety-focused tasks. Our reinforcement learning is conducted on a small set of high-quality math-focused problems with verifiable solutions. 
  • Benefits of Supervised Finetuning (SFT): Phi-4-reasoning after the SFT stage already performs strongly across diverse benchmarks. Interestingly, the improvement in performance generalizes tasks not directly targeted in the training data—such as calendar planning and general-purpose benchmarks (Table 2). We highlight the critical role of data mixture and training recipe in unlocking reasoning capabilities during the SFT stage, which goes hand-in-hand with our data selection and filtering.  
  • Boost with Reinforcement Learning: we are encouraged by the gains achieved through a short round of outcome-based reinforcement learning (RL) and the potential of combining distillation/SFT and reinforcement learning. We observe that the model after RL provides higher accuracy on math while using approximately 1.5x more tokens than the SFT model on average, offering a trade-off between accuracy and inference-time compute. 

Reasoning is a meta skill 

We think that reasoning is a transferable meta-skill that can be learned through supervised finetuning alone and further enhanced with reinforcement learning. To test the generalization of the models’ reasoning capabilities, we evaluate them on multiple new reasoning benchmarks that require algorithmic problem solving and planning, including 3SAT (3-literal Satisfiability Problem), TSP (Traveling Salesman Problem), and BA-Calendar planning. These reasoning tasks are nominally out-of-domain for the models as the training process did not target these skills, but the models show strong generalization to these tasks as shown in Figure 2. 

Average pass@1 accuracy on general-purpose benchmarks

This generalized improvement in capabilities also goes beyond reasoning. Without explicit training on non-reasoning tasks, we saw significant improvements on IFEval, FlenQA, and internal PhiBench as shown in Table 2. And despite limited coding data during the SFT stage (and none during RL), the model performs well, scoring at o1-mini level on LiveCodeBench (LCB) and Codeforces as shown in Table 1. We plan to emphasize coding further in our future versions.

Figure 3. Average Pass@1 performance on reasoning benchmarks, averaged across five runs. Except for GPQA, other benchmarks are out-of-distribution with respect to Phi-4-reasoning’s training data. 

Lessons on Evaluating Reasoning Models 

Language models exhibit large generation nondeterminism, i.e., they may produce substantially different answers given the same prompts and inference hyperparameters (e.g., temperature). To account for this stochastic nature, we study the accuracy distribution on AIME 2025, approximated by kernel density estimation of 50 independent runs with the same prompt and temperature. We have found several interesting observations as illustrated in Figure 4: 

  1. All models show a high accuracy variance. For example, accuracy of answers generated by DeepSeek-R1- Distill-Llama-70B ranges from 30% to 70%, while o3-mini’s accuracy ranges from 70% to 100%. This suggests that any comparison among models using a single run can easily produce misleading conclusions.  
  1. Models on the two extremes of average accuracy demonstrate more robust accuracy. For example, Phi-4-reasoning-plus and Phi-4 have relatively narrower accuracy ranges compared to DeepSeek-R1-Distill-Llama-70B and Phi-4-reasoning.  
  1. The accuracy distribution further indicates the competitive performance of Phi-4-reasoning-plus, largely intersecting with o3-mini’s distribution and being almost disjoint from DeepSeek-R1-Distill-Llama-70B’s distribution.  
chart, line chart

Phi-4-Reasoning in action

Below we provide some interesting example responses from Phi-4-reasoning that showcases its intelligent behavior. 

Example  - calendar planning
Example - ridde

Prompt: “Generate a website for steves pc repairs using a single html script”

Prompt: “write a Python program that shows a ball bouncing inside a spinning triangle. The ball must bounce off the rotating walls realistically and should not leave the triangle”

References

[1] “Phi-4-reasoning Technical Report.” arXiv preprint arXiv:2504.21318 (2025). [link (opens in new tab)

[2] “Phi-4 technical report.” arXiv preprint arXiv:2412.08905 (2024). 

[3] “Phi-3 technical report: A highly capable language model locally on your phone.” arXiv preprint arXiv:2404.14219 (2024).  

[4] “Phi-2: The surprising power of small language models.” Microsoft Research Blog (2023). 

[5] “Textbooks are all you need.” arXiv preprint arXiv:2306.11644 (2023). 

[6] “Agentinstruct: Toward generative teaching with agentic flows.” arXiv preprint arXiv:2407.03502 (2024).  

[7] “Orca 2: Teaching small language models how to reason.” arXiv preprint arXiv:2311.11045 (2023).   

[8] “Orca: Progressive learning from complex explanation traces of gpt-4.” arXiv preprint arXiv:2306.02707 (2023). 

The post Phi-Reasoning: Once again redefining what is possible with small and efficient AI  appeared first on Microsoft Research.

]]>
Eureka Inference-Time Scaling Insights: Where We Stand and What Lies Ahead http://approjects.co.za/?big=en-us/research/articles/eureka-inference-time-scaling-insights-where-we-stand-and-what-lies-ahead/ Tue, 29 Apr 2025 08:04:07 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1137949 Understanding and measuring the potential of inference-time scaling for reasoning. The new Eureka study tests nine state-of-the-art models on eight diverse reasoning tasks.

The post Eureka Inference-Time Scaling Insights: Where We Stand and What Lies Ahead appeared first on Microsoft Research.

]]>
Authors: Vidhisha Balachandran, Jingya Chen (opens in new tab), Lingjiao Chen, Shivam Garg (opens in new tab), Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu (opens in new tab), Safoora Yousefi

Do reasoning capabilities of large reasoning models extend to complex reasoning skills beyond math? What is their advantage when compared to conventional, autoregressive models? What is left to harvest in the reasoning space and how far can we go from here? Do longer and extended CoT scratchpads always translate to higher accuracy? This blog summarizes answers to these questions by using insights from the recent Eureka report on inference-time scaling: “Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead (opens in new tab)”.

For extracting these insights, the study uses experiments on eight diverse complex reasoning tasks on nine state-of-the-art models at the frontier of Artificial Intelligence today. The tasks include:

  • Math reasoning (Benchmarks: AIME 2025, AIME 1983-2024, OmniMATH)  
  • Science reasoning (Benchmarks: GPQA)
  • Planning and scheduling (Benchmarks: BA Calendar)
  • NP-hard algorithmic reasoning (Benchmarks: TSP for traveling salesman minimal paths and 3SAT on 3-literal satisfiability)
  • Spatial understanding (Benchmarks: Spatial Understanding and Maze)

All these tasks were used to test conventional models like: Claude 3.5 Sonnet, Gemini 2.0 Pro, GPT-4o, and Llama 3.1 405B, as well as reasoning models: Claude 3.7 Sonnet, DeepSeek R1, Gemini 2.0 Flash Thinking, O1, and O3-mini.

To estimate the future potential of all models we ran all experiments several times following two different scaling approaches. In the parallel approach, we make N independent calls to the model and aggregate the results via different aggregators: average, majority vote, best of N, worst of N. In the sequential approach, the model is set to sequentially attempt to solve the problem and if it is incorrect, it receives feedback from another model inference call until the context budget is exhausted or N trials are done.

All experiment implementations and data are available on Eureka ML Insights (opens in new tab), which is an open-source framework for standardizing evaluations of large foundation models, and for extracting insights beyond single-score reporting and rankings.

Finding 1: There exists a large gap between conventional models and models trained for inference-time compute (aka reasoning models) on complex tasks, indicating a major update on the state of the art. Improved reasoning also extends and generalizes to algorithmic and planning problems beyond math.

In math benchmarks, reasoning models surpass their conventional counterparts often by more than 50 percentage points in accuracy. It is also interesting to see major improvements in algorithmic problems such as NP-hard problems like Satisfiability (3SAT) and Traveling Salesman Path Optimization (TSP), as well as calendar planning. Improvements in spatial understanding (Maze and SpatialMap) and scientific reasoning however are less pervasive across model families but still often over 20 percentage points.

A radar chart showing the performance of best and worse models on different reasoning benchmarks. The red frontier shows the performance of the worse model. The green frontier shows the performance of the best model, indicating the best-known result with current technology. The blue horizon between the best model and the maximum performance shows the room for improvement for mastering the capability. The best performance sets indicated in the green border include all models that perform within 2% of the best observed result.
Figure 1 – Performance of best and worse models on different reasoning benchmarks. The red frontier shows the performance of the worse model. The green frontier shows the performance of the best model, indicating the best-known result with current technology. The blue horizon between the best model and the maximum performance shows the room for improvement for mastering the capability. The best performance sets indicated in the green border include all models that perform within 2% of the best observed result.

Finding 2: The effectiveness of inference-time scaling varies between domains and tasks, with diminishing returns as task complexity increases.

As shown in Figure 2, an in-depth analysis on the GPQA benchmark for scientific problems, reveals that while reasoning models all achieve an accuracy of higher than 90% for Physics, they still lag behind on Biology and Chemistry. In algorithmic problems and other problems that have a notion of difficulty, model accuracy drops even for the best models as difficulty increases and the length of reasoning traces saturates.

(Left) A bar chart showing the break down of performance for GPQA (scientific reasoning). (Right) A line chart showing performance for TSP (NP-hard Traveling Salesman Path Optimization). Improvements of reasoning models are lower on Chemistry and Biology, and they also drop as the problem gets more difficult for TSP. L1 corresponds to graphs with 6 nodes, and L8 graphs have 13 nodes.
Figure 2 – Break down of performance for GPQA (scientific reasoning) and TSP (NP-hard Traveling Salesman Path Optimization). Improvements of reasoning models are lower on Chemistry and Biology, and they also drop as the problem gets more difficult for TSP. L1 corresponds to graphs with 6 nodes, and L8 graphs have 13 nodes.

Finding 3: A reasoning model that uses more tokens for a given problem is not always the most accurate one. Even for the same model, longer generations are on average less accurate than the shorter ones.

There is high variability in token use, even across models with similar accuracies on a task. For example, in Figure 3, we can observe that often there exist pairs of models that have similar accuracy but one of them uses a lot more tokens (e.g. for AIME 25, DeepSeek-R1 and Claude 3.7 Sonnet Thinking have an average accuracy across five repeats within a < 3% range, but Claude 3.7 Sonnet Thinking uses at least 2.5 times more tokens).

A scatter chart showing the tradeoff between accuracy and token usage for all benchmarks. The standard deviation for accuracy (vertical, filled line) is computed across 5 different repetitions. The standard deviation for token usage (horizontal, dotted line) is computed by first taking the standard deviation per data instance, and then averaging by the size of the benchmark, to show the variability per instance.
Figure 3 – Tradeoff between accuracy and token usage for all benchmarks. The standard deviation for accuracy (vertical, filled line) is computed across 5 different repetitions. The standard deviation for token usage (horizontal, dotted line) is computed by first taking the standard deviation per data instance, and then averaging by the size of the benchmark, to show the variability per instance.

Figure 4 illustrates the average accuracy over generation lengths for the DeepSeek R1 model and O3-mini high on the GPQA task.

Whisker chart showing the average accuracy of DeepSeek R1 for different bins of generation lengths. Longer CoT solutions are less accurate on average for reasoning models. Example on the GPQA benchmark.
Whisker chart showing the average accuracy of O3 mini high for different bins of generation lengths. Longer CoT solutions are less accurate on average for reasoning models. Example on the GPQA benchmark.
Figure 4 – Longer CoT solutions are less accurate on average for reasoning models. Example on the GPQA benchmark.

Finding 4: Repeated queries to the same model can yield highly variable token usage, introducing cost nondeterminism for developers and users- even when the model consistently provides correct answers.

Horizontal whiskers in Figure 3 are a measure of cost nondeterminism as they show the variability within a single prompt (data instance). In Table 1, we summarize these charts and show the actual cost in dollars on average for 1000 prompts, with today’s prices per provider. This shows that the variability on token length can translate to up to 40% variability in actual cost for almost all reasoning models.

A table showing the average accuracy and average output price for 1000 prompts picked randomly from our benchmarks. Output token prices are computed based on what the original vendor’s prices (OpenAI, Anthropic, DeepSeek). For Llama 3.1 405B prices are computed based on Azure pricing for serverless deployments.
Table 1 – Average accuracy and average output price for 1000 prompts picked randomly from our benchmarks. Output token prices are computed based on what the original vendor’s prices (OpenAI, Anthropic, DeepSeek). For Llama 3.1 405B prices are computed based on Azure pricing for serverless deployments.
A scatter plot showing the average accuracy vs. average output price for 1000 prompts picked randomly from our benchmarks.
Figure 5 – Average accuracy vs. average output price for 1000 prompts picked randomly from our benchmarks.

Finding 5: There exists untapped potential for improving both conventional models and models trained for inference-time compute.

To conduct this analysis, we run all our experiments 5 times and see whether a correct inference path exists by checking with a “perfect” verifier which has access to ground truth. See examples of results in Figure 6. The existence of the inference path shows that it is possible to extract that skill or knowledge from the model with better fine tuning and RL techniques. This emphasizes the importance of building improved and generalizable verifiers that can be used for further development. In fact, investments in better verifiers for different domains can become the distinguishing factor in the current AI space that determine the speed of progress in generalizing reasoning for a broad number of use cases.

Two bar charts showing results on Omni-MATH (left) & TSP (right) with different aggregations by parallel scaling on 5 runs. The red line indicates the lowest best-of-5 accuracy observed across all models. The blue line represents the highest average pass@1 accuracy. Better inference paths exist for all models, and there is even potential for further improvement on reasoning models.
Figure 6 – Results on Omni-MATH & TSP with different aggregations by parallel scaling on 5 runs. The red line indicates the lowest best-of-5 accuracy observed across all models. The blue line represents the highest average pass@1 accuracy. Better inference paths exist for all models, and there is even potential for further improvement on reasoning models.

Finding 6: Current reasoning models (in our case O1) improve more efficiently upon receiving feedback on their solutions than conventional models on the most complex tasks.

Figure 7 shows results on experiments that simulate sequential iterations on O1 and GPT-4o, where the model first attempts a solution and then receives feedback from another judge (of the same type) to make another attempt on the solution, if the previous one was incorrect, until the context length is depleted. Here, O1 improves much faster than GPT-4o, and its improvements are even faster with sequential feedback.

A line chart showing parallel (independent) and sequential scaling with feedback on TSP hardest tasks (graphs of 13 nodes) for O1 and GPT-4o.
Figure 7 – Parallel (independent) and sequential scaling with feedback on TSP hardest tasks (graphs of 13 nodes).

The post Eureka Inference-Time Scaling Insights: Where We Stand and What Lies Ahead appeared first on Microsoft Research.

]]>
AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness http://approjects.co.za/?big=en-us/research/articles/autogen-v0-4-reimagining-the-foundation-of-agentic-ai-for-scale-extensibility-and-robustness/ Tue, 25 Feb 2025 19:36:13 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1123776 Gagan Bansal introduces a transformative update to the AutoGen framework that builds on user feedback and redefines modularity, stability, and flexibility to empower the next generation of agentic AI research and applications.

The post AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness appeared first on Microsoft Research.

]]>
Presented by Gagan Bansal at Microsoft Research Forum, February 2025

headshot of Gagan Bansal

“When we released AutoGen, one of the first things that the developers absolutely loved about it was its simplicity and the many pre-built agents and teams that it provided, such as the user proxy agent and the assistant agent, and the group chat between multiple agents. With the AutoGen AgentChat layer, we are maintaining these features and adding tons of more essential features such as streaming support, serialization, state management and memory for agents, and finally full-time support for a better development experience.”

Gagan Bansal, Senior Researcher, Microsoft Research AI Frontiers

Transcript: Lightning Talk

AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness

Gagan Bansal, Senior Researcher, Microsoft Research AI Frontiers

This talk introduces a transformative update to the AutoGen framework that builds on user feedback and redefines modularity, stability, and flexibility to empower the next generation of agentic AI research and applications.

Microsoft Research Forum, February 25, 2025

FRIEDERIKE NIEDTNER, Principal Technical Research Program Manager, Microsoft Research AI Frontiers: The following talk invites us to follow the journey of AutoGen from a leading open-source framework for multi-agent applications to a complete redesign that lays the foundation for the future of agentic AI research and applications with the release of AutoGen 0.4 (opens in new tab). The framework’s new layered architecture provides flexibility and scalability and includes an ecosystem of extensions and applications, some created by the same team, such as Magentic-One, a team of generalist agents, and Studio, a low-code developer tool. AutoGen 0.4 is also a story about collaboration between MSR, partners within Microsoft, and a vibrant open-source community.

GAGAN BASAL: Hi, I am Gagan Bansal and I am a researcher at Microsoft Research AI Frontiers. And today I’ll talk about some exciting technical updates to AutoGen, a leading open-source framework for agentic AI. And although I am presenting, this is joint work with many incredible colleagues and interns at Microsoft over the last year.  

AutoGen is a leading open-source framework for multi-agent applications that we released in fall 2023. It enables developers and researchers to create intelligent applications using large language models, tool use, and multi-agent collaboration patterns. With AutoGen, our goal has been to lead the innovation in agentic AI research. When we first launched AutoGen in Fall 2023, it quickly became the leading open-source framework for agentic AI, and it continues to empower developers and researchers in many, many domains, including business process automation, marketing, finance, security, and others. 

Since AutoGen’s launch, we’ve not just been maintaining it. We’ve been listening closely to feedback from developers and researchers, and in this rapidly evolving landscape of AI progress, their expectations were high. Users told us that they needed greater modularity and the ability to reuse agents seamlessly. They also asked for better support for debugging and scaling their agentic solutions. And finally, there were many apps to enhance the code quality and maturity of the platform. 

Pursuing these needs required us to question our assumptions and even possibly reimagine the platform. So, in early 2024, we used these learnings to experiment with alternate architectures, and we ended up adopting an actor model for multi-agent orchestration. The actor model is a well-known programming model for concurrent programing and high use systems. Here, actors are the computational building blocks that can exchange messages and also perform work. In Fall 2024, we announced a preview of this version and this new year, we’re thrilled to announce a full release. In summary, AutoGen v0.4 is our response to address our users’ feedback in this evolving landscape of AI research. AutoGen is now not just a framework, but it’s a whole ecosystem for agentic AI. It provides you with a framework that lets you build sophisticated agents and multi-agent applications, and it also provides you with developer tools and many well-defined applications. 

Let me first tell you about the AutoGen framework. At the heart of this release is a layered architecture that is designed for flexibility and scalability. At the base is AutoGen Core. This layer implements the actor model for agents. Building on core is AutoGen AgentChat. This layer provides a simple and easy to use API that is perfect for rapid prototyping. And building on Core and AgentChat is Extensions. 

This layer provides advanced clients, agents and teams, and integrations with third party software. This layered architecture is nice because whether you are an advanced developer or a researcher prototyping new ideas, AutoGen provides you with the tools you need for your project’s stage of development. The Core implements an actor model for agentic AI. At the highest level, this implementation provides two key features. 

The first is asynchronous message exchange between agents. It does so by providing a runtime, and then it also provides event-driven agents that perform computations in response to these messages. There are several implications of this design, and one of them is that it decouples how the messages are delivered between the agents from how the agents handle them. This naturally improves the modularity and scalability of agentic workflows built with AutoGen, especially for deployment. 

The Core’s event-driven architecture provides several other benefits. For example, it provides affordances to observe and control agent behavior, which is crucial for responsible development of agentic technology. It also enables running multiple agents on different processes and even implementing them using different languages. Finally, it enables developers to implement a large class of multi-agent patterns, including static and dynamic workflows. 

When we released AutoGen, one of the first things that the developers absolutely loved about it was its simplicity and the many pre-built agents and teams that it provided, such as the user proxy agent and the assistant agent, and the group chat between multiple agents. With the AutoGen AgentChat layer, we are maintaining these features and adding tons of more essential features such as streaming support, serialization, state management and memory for agents, and finally full-time support for a better development experience. 

Please check out the link below for the migration guide. Finally, the Extension layers provide advanced runtimes, tools, clients, and ecosystem integrations that continuously expand the framework’s capabilities. In addition to the framework, this new release also provides upgrades to essential developer tools and applications built using AutoGen. And here I’ll briefly mention two of them. In late 2023, we also released AutoGen Studio, which is a low code tool for authoring multi-agent applications. 

And we are excited to announce that with version 0.4, Studio has received massive upgrades. It now supports a drag and drop, multi-agent builder. It supports real time updates as agents solve tasks, flow visualizations and execution controls, so that the users remain in control, and component galleries so that the community can discover and build on each other’s work. We’ve always believed that the framework should enable state-of-the-art applications for solving complex tasks with agents, which is why we’ve been building applications with the framework ourselves and using that to guide the framework’s development. 

Last year, we released Magentic-One, a state-of-the-art multi-agent team for solving file- and web-related tasks built using AutoGen. And now its developer API, and general capabilities, such as sophisticated orchestrators and specialized agents such as the web server and the file server, are now available in the AutoGen ecosystem. For us, this new ecosystem is only the beginning and sets the stage for future innovation in agentic AI. 

Over the past two years, our team has made early progress in AI agents and we continue to deeply think about the changing landscape of current AI research and continue to invest in taking steps to help lead the innovation on agents. And by the way, we’re also working closely with our colleagues at Semantic Kernel, to provide an enterprise ready multi-agent runtime for AutoGen. 

Thank you for attending Microsoft Research Forum. Please check out these links to learn more about AutoGen.

The post AutoGen v0.4: Reimagining the foundation of agentic AI for scale, extensibility, and robustness appeared first on Microsoft Research.

]]>
Belief state transformers http://approjects.co.za/?big=en-us/research/articles/belief-state-transformers/ Tue, 25 Feb 2025 19:33:38 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1123773 John Langford talks about a new transformer architecture that generates compact belief states for goal-conditioned planning, enhancing planning algorithms' efficiency and effectiveness.

The post Belief state transformers appeared first on Microsoft Research.

]]>
Presented by John Langford at Microsoft Research Forum, February 2025

Portrait of John Langford

“That ability to condition on generation, rather than evaluate the generation, ends up being amazingly useful in terms of giving you a more honest valuation of the generated text.”

John Langford, Partner Research Manager, Microsoft Research AI Frontiers

Transcript: Lightning Talk

Belief state transformers

John Langford, Partner Research Manager, Microsoft Research AI Frontiers

This talk showcases a new transformer architecture that generates compact belief states for goal-conditioned planning, enhancing planning algorithms’ efficiency and effectiveness.

Microsoft Research Forum, February 25, 2025

FRIEDERIKE NIEDTNER, Principal Technical Research Program Manager, Microsoft Research AI Frontiers: Transformer models have brought us a revolution in language modeling with their capability to generate impressive language with many emergent properties. At the same time, LLMs have a number of weaknesses, one being that they are not very good at evaluating their own output. Let’s hear how the new Belief State Transformer architecture unlocks new abilities by combining a standard GPT-style architecture of a forward encoder for token prediction with an additional backward encoder.

JOHN LANGFORD: I’m John Langford. I’d like to tell you about belief state transformers, which is a new paper we have in archives, and which is also accepted at ICLR [International Conference on Learning Representations]. There are many coauthors on this paper. I’d like to thank them, particularly Edward, who did much of the work here.  

To start with, let’s talk about standard GPT-style transformers. In standard GPT style transformers, you have a sequence of symbols which are going into a forward encoder, and then the forward encoder outputs some information to the output head, and then the output head predicts the final token. So, this is a straightforward approach and yet amazingly powerful. It’s kind of the key backbone behind GPT-4 and other language models.  

For the purposes of research, though, we need to have something to think about, to complain about, and I’m going to complain about self-evaluation. Often these language models can’t be used to evaluate their own output too well, because the generation of the next token is done by exactly the mechanism you would use to evaluate it in that output head. So, this is kind of like grading yourself, and like grading yourself you can miss things that an independent grader would actually see pretty well. 

Right, so a belief state transformer changes the architecture. And so, it’s taking two transformers and grafting them together. One of them is going to be the standard forward encoder on the prefix. And then we’re also going to have another transformer, which is a backward encoder on the suffix. These are both going to put out some information, which goes to the output head. And the output head is going to predict the next token and the previous token. So, it’s the next token of the prefix and the previous token of the suffix. Something to worry about with these transformers is the computation. So, these are transformers obviously doing more computation. But it turns out that this “more computation” is only in a constant factor of more computation. 

And the key observation here is that in the forward encoder, just doing the attention, what you’re going to use in the GPT-style transformer, is already order N-squared [N2]. Every token looks at every previous token in order to figure out what information is necessary to predict the next token. In the belief state transformer, that happens twice. You have two different transformers, each with their own attention, and so you pay a factor of two.  

And then, in addition, you’re going to pay because the number of times you evaluate the head, the output head, is order n squared because there are order N-squared prefix/suffix pairs. So, there’s a constant factor increasing computation, which is problematic, but it’s not like the end of the world. You can subsample or things like that. And what you get in return is order N-squared gradients rather than order N gradients. 

In a standard GPT-style transformer, you only have order N gradients because you only have order N symbols, and you get one gradient per symbol. Here you get order N-squared gradients because you have order N-squared prefix/suffix pairs. That means there’s many more ways to get information out of a sequence. And that unlocks the possibility of learning new things that were previously unlearnable. 

Okay, so now let’s go on to the belief state. Why are we talking about a belief state when we say belief state transformer. Well, it turns out you can prove a theorem. And this theorem says that the output of the forward encoder is a belief state for the prefix. So what that means is that the output of the forward encoder will converge to all the information necessary to predict the future. So that’s all symbols after the prefix. So, that ability to create a compact belief state is new with belief state transformers, something that previously we only really knew how to do with state space machines.  

Okay, so let’s try this out. Looking at Tiny Stories. Tiny Stories is dataset where you have a bunch of children stories, which are generated by GPT-4. 

We’re going to feed a prefix and a suffix into our system, and it’s going to fill in the middle, which is what happens in blue. And then for a baseline, we’re going to compare the fill-in-the-middle approach to using GPT-style transformers. So the way the fill-in-the-middle approach works with GPT-style transformers is you take the prefix, and then you add the suffix, and then you just predict the tokens after that. 

So that works reasonably well. This is very commonly used. And now if we have these two different approaches the question is how do we actually value these different approaches? Which one is better? So, the way we’re going to judge this is we’re going to ask GPT-4 which is better in various ways: syntax, style, and so forth. And then we’ll ask it for a summary judgment, which is a standard technique. 

We looked at what it was doing, and it seemed very reasonable. And in doing this, we end up with the belief state transformer winning about a factor of three more often than the GPT-style transformer. So that’s huge. It’s so huge that you really want to understand why. And it seems like the key here is self-evaluation. So, under the hood, we’re actually running each of these, say 120 times, using a beam search. The code for that is on the right. So, given the beam search, you have several different possible completions. And now how do you choose which completion to actually use? Because you have to pick one of these. You’re trying to pick a completion. And for the GPT-style transformer, there’s only one way to really do this. The way is you take the next head, and you use it as a probability function, and you look at the probability of the sequence of tokens which is produced.  

That works reasonably well. It actually does improve picking out a high-probability sequence of tokens versus a lower probability sequence of tokens. But it’s not as much as you get with the belief state transformer. And the reason why is the self-grading issue that I was talking about earlier. There’s many ways that a system could be blind to its own mistakes. With the belief state transformer, though, you have another option, because the next head can instead condition on the generated data and run over the suffix in order to value the generated data. 

So, that ability to condition on generation, rather than evaluate the generation, ends up being amazingly useful in terms of giving you a more honest valuation of the generated text. All right, so just to summarize, we have this belief state transformer. This learns a compact belief state, which is a new thing in transformers. It gives us a way to have a simple set of values, which summarize all information we need to predict the future. 

And this seems to provide a very strong form of self-evaluation, which is potentially very useful in many situations where you’re trying to use test-time compute, or even using test-time compute to further create training data. So, this is more in the paper. There’s some other things that you can do with transformer that are kind of new. 

I think the biggest question in my mind is what happens when you scale this up? And, of course, we’re working on that. That’s one of the great things about being in MSR [Microsoft Research]. They have some GPUs to scale this up to much larger datasets. So, stay tuned. And, thank you. 

The post Belief state transformers appeared first on Microsoft Research.

]]>
OmniParser V2: Turning Any LLM into a Computer Use Agent http://approjects.co.za/?big=en-us/research/articles/omniparser-v2-turning-any-llm-into-a-computer-use-agent/ Wed, 12 Feb 2025 18:31:35 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1129176 Yadong Lu, Senior Researcher; Thomas Dhome-Casanova (opens in new tab), Software Engineer; Jianwei Yang, Principal Researcher; Ahmed Awadallah, Partner Research Manager Graphic User interface (GUI) automation requires agents with the ability to understand and interact with user screens. However, using general purpose LLM models to serve as GUI agents faces several challenges: 1) reliably identifying […]

The post OmniParser V2: Turning Any LLM into a Computer Use Agent appeared first on Microsoft Research.

]]>
Yadong Lu, Senior Researcher; Thomas Dhome-Casanova (opens in new tab), Software Engineer; Jianwei Yang, Principal Researcher; Ahmed Awadallah, Partner Research Manager

Graphic User interface (GUI) automation requires agents with the ability to understand and interact with user screens. However, using general purpose LLM models to serve as GUI agents faces several challenges: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associating the intended action with the corresponding region on the screen. OmniParser closes this gap by ‘tokenizing’ UI screenshots from pixel spaces into structured elements in the screenshot that are interpretable by LLMs. This enables the LLMs to do retrieval based next action prediction given a set of parsed interactable elements.

OmniParser V2 takes this capability to the next level. Compared to its predecessor, it achieves higher accuracy in detecting smaller interactable elements and faster inference, making it a useful tool for GUI automation. In particular, OmniParser V2 is trained with a larger set of interactive element detection data and icon functional caption data. By decreasing the image size of the icon caption model, OmniParser V2 reduces the latency by 60% compared to the previous version. Notably, Omniparser+GPT-4o achieves state-of-the-art average accuracy of 39.6 on a recently released grounding benchmark ScreenSpot Pro (opens in new tab), which features high resolution screen and tiny target icons. This is a substantially improvement on GPT-4o’s original score of 0.8.

screen spot pro performance

To enable faster experimentation with different agent settings, we created OmniTool, a dockerized Windows system that incorporates a suite of essential tools for agents. Out of the box, we enable OmniParser to be used with a variety of state-of-the-art LLMs: OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) and Anthropic (Sonnet) combining the screen understanding, grounding, action planning and execution steps.

Risks and Mitigations

To align with the Microsoft AI principles and Responsible AI practices, we conduct risk mitigation by training the icon caption model with Responsible AI data, which helps the model avoid inferring sensitive attributes (e.g.race, religion etc.) of the individuals which happen to be in icon images as much as possible. At the same time, we encourage user to apply OmniParser only for screenshot that does not contain harmful content. For the OmniTool, we conduct threat model analysis using Microsoft Threat Modeling Tool overview – Azure | Microsoft Learn (opens in new tab). We provide a sandbox docker container, safety guidance and examples in our GitHub Repository. And we advise a human to stay in the loop in order to minimize the risk.


The post OmniParser V2: Turning Any LLM into a Computer Use Agent appeared first on Microsoft Research.

]]>
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks http://approjects.co.za/?big=en-us/research/articles/magentic-one-a-generalist-multi-agent-system-for-solving-complex-tasks/ Tue, 05 Nov 2024 02:35:39 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1100601 By Adam Fourney, Principal Researcher; Gagan Bansal, Senior Researcher; Hussein Mozannar, Senior Researcher; Victor Dibia, Principal Research Software Engineer; Saleema Amershi, Partner Research Manager Contributors: Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang (Eric) Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor […]

The post Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks appeared first on Microsoft Research.

]]>
By Adam Fourney, Principal Researcher; Gagan Bansal, Senior Researcher; Hussein Mozannar, Senior Researcher; Victor Dibia, Principal Research Software Engineer; Saleema Amershi, Partner Research Manager

Contributors: Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang (Eric) Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, Saleema Amershi

An illustrated workflow of Magentic-One completing a complex task from the GAIA agentic benchmark. The workflow starts with a description of the Task which reads “The attached image contains a Python script. Run the Python code against an array of strings, listed below. Output of the script is a URL containing C++ source code, compile, run and return the sum of the third and fifth integers…” The task description is shown flowing to the Orchestrator agent which then creates a dynamic/task-specific plan. The rest of the workflow lists the steps of the task being executed by the other agents on the Magentic-One team. First, the File Surfer accesses the image provided in the task description and extracts the code. Second, the Coder agent analyzes the Python code from the image. Third, the Computer Terminal executes the code provided by the Coder agent, outputting an url string. Fourth, the Web Surfer agent navigates to the url and extracts the C++ code shown on the page. Fifth, the Coder agent analyzes the C++ code. Sixth, the Computer Terminal executes the C++ code. Finally, the Orchestrator determines the task is complete and outputs the final result.
We are introducing Magentic-One, our new generalist multi-agent system for solving open-ended web and file-based tasks across a variety of domains. Magentic-One represents a significant step towards developing agents that can complete tasks that people encounter in their work and personal lives. We are also releasing an open-source implementation of Magentic-One on Microsoft AutoGen, our popular open-source framework for developing multi-agent applications.

The future of AI is agentic. AI systems are evolving from having conversations to getting things done—this is where we expect much of AI’s value to shine. It’s the difference between generative AI recommending dinner options to agentic assistants that can autonomously place your order and arrange delivery. It’s the shift from summarizing research papers to actively searching for and organizing relevant studies in a comprehensive literature review.

Modern AI agents, capable of perceiving, reasoning, and acting on our behalf, are demonstrating remarkable performance in areas such as software engineering, data analysis, scientific research, and web navigation. Still, to fully realize the long-held vision of agentic systems that can enhance our productivity and transform our lives, we need advances in generalist agentic systems. These systems must reliably complete complex, multi-step tasks across a wide range of scenarios people encounter in their daily lives.

Introducing Magentic-One (opens in new tab), a high-performing generalist agentic system designed to solve such tasks. Magentic-One employs a multi-agent architecture where a lead agent, the Orchestrator, directs four other agents to solve tasks. The Orchestrator plans, tracks progress, and re-plans to recover from errors, while directing specialized agents to perform tasks like operating a web browser, navigating local files, or writing and executing Python code.

Magentic-One achieves statistically competitive performance to the state-of-the-art on multiple challenging agentic benchmarks, without requiring modifications to its core capabilities or architecture. Built on AutoGen (opens in new tab), our popular open-source multi-agent framework, Magentic-One’s modular, multi-agent design offers numerous advantages over monolithic single-agent systems. By encapsulating distinct skills in separate agents, it simplifies development and reuse, similar to object-oriented programming. Magentic-One’s plug-and-play design further supports easy adaptation and extensibility by enabling agents to be added or removed without needing to rework the entire system—unlike single-agent systems, which often struggle with inflexible workflows.

We’re making Magentic-One open-source (opens in new tab) for researchers and developers. While Magentic-One shows strong generalist capabilities, it’s still far from human-level performance and can make mistakes. Moreover, as agentic systems grow more powerful, their risks—like taking undesirable actions or enabling malicious use-cases—can also increase. While we’re still in the early days of modern agentic AI, we’re inviting the community to help tackle these open challenges and ensure our future agentic systems are both helpful and safe. To this end, we’re also releasing AutoGenBench (opens in new tab), an agentic evaluation tool with built-in controls for repetition and isolation to rigorously test agentic benchmarks and tasks while minimizing undesirable side-effects.

How it works

A diagram illustrating Magentic-One’s multi-agent architecture. The diagram depicts the inner working of the Orchestrator agent at the top and points to the other agents on the team at the bottom. Within the Orchestrator, an outer and inner loop are depicted. The outer loop shows a task ledger, which contains facts, guesses, and the current plan, and a pointer into and out of an inner loop. The inner loop shows a progress ledger, which tracks the current task progress and assignments for each agent, pointing to a decision node with the text “Task complete?”. If “Yes” the diagram shows the flow breaking out of the Orchestrator and pointing to a “Task Complete” termination node. If “No” the diagram shows the flow pointing to another decision node with the text “Progress being made?”. If “Yes” the flow points out of the Orchestrator toward one of the other agents on the team, indicating a handoff of control. If “No”, the flow points to third decision node with the text “Stall count > 2”. If “Yes” the flow goes back to the outer loop’s Task Ledger which is updated before the agents try again. If “No”, the flow again points out of the Orchestrator toward one of the other agents. The other agents depicted at the bottom of the diagram are named and described as follows: a Coder (“Write code and reason to solve tasks”), Computer Terminal (“Execute code written by the coder agent”), WebSurfer (“Browse the internet (navigate pages, fill forms, etc)”), and a FileSurfer (“Navigate files (e.g., PDFs, pptx, WAV, etc)”).
Magentic-One features an Orchestrator agent that implements two loops: an outer loop and an inner loop. The outer loop (lighter background with solid arrows) manages the task ledger (containing facts, guesses, and plan) and the inner loop (darker background with dotted arrows) manages the progress ledger (containing current progress, task assignment to agents).

Magentic-One work is based on a multi-agent architecture where a lead Orchestrator agent is responsible for high-level planning, directing other agents and tracking task progress. The Orchestrator begins by creating a plan to tackle the task, gathering needed facts and educated guesses in a Task Ledger that is maintained. At each step of its plan, the Orchestrator creates a Progress Ledger where it self-reflects on task progress and checks whether the task is completed. If the task is not yet completed, it assigns one of Magentic-One other agents a subtask to complete. After the assigned agent completes its subtask, the Orchestrator updates the Progress Ledger and continues in this way until the task is complete. If the Orchestrator finds that progress is not being made for enough steps, it can update the Task Ledger and create a new plan. This is illustrated in the figure above; the Orchestrator work is thus divided into an outer loop where it updates the Task Ledger and an inner loop to update the Progress Ledger.

Magentic-One consists of the following agents:

  • Orchestrator: The lead agent responsible for task decomposition, planning, directing other agents in executing subtasks, tracking overall progress, and taking corrective actions as needed
  • WebSurfer: An LLM-based agent proficient in commanding and managing the state of a Chromium-based web browser. For each request, the WebSurfer performs actions such as navigation (e.g., visiting URLs, performing searches), interacting with webpages (e.g., clicking, typing), and reading actions (e.g., summarizing, answering questions). It then reports on the new state of the webpage. The WebSurfer relies on the browser’s accessibility tree and set-of-marks prompting to perform its tasks.
  • FileSurfer: An LLM-based agent that commands a markdown-based file preview application to read local files. It can also perform common navigation tasks such as listing directory contents and navigating through them.
  • Coder: An LLM-based agent specialized in writing code, analyzing information collected from the other agents, and creating new artifacts.
  • ComputerTerminal: Provides access to a console shell for executing programs and installing new libraries.

Together, Magentic-One’s agents equip the Orchestrator with the tools and capabilities it needs to solve a wide range of open-ended problems and autonomously adapt to, and act in, dynamic and ever-changing web and file-system environments.

While the default multimodal LLM used for all agents is GPT-4o, Magentic-One is model-agnostic, allowing the integration of heterogeneous models to support different capabilities or meet different cost requirements. For example, different LLMs and SLMs or specialized versions can power different agents. For the Orchestrator, we recommend a strong reasoning model, like GPT-4o. In a different configuration, we also experimented with using OpenAI o1-preview for the Orchestrator’s outer loop and for the Coder, while other agents continued to use GPT-4o.

Evaluation

To rigorously evaluate Magentic-One’s performance, we introduce AutoGenBench, an open-source standalone tool for running agentic benchmarks that allows repetition and isolation, e.g., to control for variance of stochastic LLM calls and side-effects of agents taking actions in the world. AutoGenBench facilitates agentic evaluation and allows adding new benchmarks. Using AutoGenBench, we can evaluate Magentic-One on a variety of benchmarks. Our criterion for selecting benchmarks is that they should involve complex multi-step tasks, with at least some steps requiring planning and tool use, including using web browsers to act on real or simulated webpages. We consider three benchmarks in this work that satisfy this criterion: GAIA, AssistantBench, and WebArena.

In the Figure below we show the performance of Magentic-One on the three benchmarks and compare with GPT-4 operating on its own and the per-benchmark highest-performing open-source baseline and non open-source benchmark specific baseline according to the public leaderboards as of October 21, 2024. Magentic-One (GPT-4o, o1) achieves statistically comparable performance to previous SOTA methods on both GAIA and AssistantBench and competitive performance on WebArena. Note that GAIA and AssistantBench have a hidden test set while WebArena does not, and thus WebArena results are self-reported. Together, these results establish Magentic-One as a strong generalist agentic system for completing complex tasks.

A bar chart showing evaluation results of Magentic-One on the GAIA, AssistantBench, and WebArena benchmarks. The bars are grouped along the x-axis by benchmark, with bars corresponding to: GPT-4, Benchmark specific non-open source SOTA, Benchmark specific open-source SOTA, Magentic-One (GPT-4o), Magentic-One (GPT-4o, o1-preview), and Human performance, in that order for each benchmark. The y-axis shows “Accuracy (%)” from 0-100%. The chart shows GPT-4 performing worst on all benchmarks (around 7%,16%, and 15%, respectively) while the human level performance (only available for GAIA and WebArena) achieves around 92% and 78%, respectively. The chart shows Magentic-One perform comparably to the SOTA solutions on all benchmarks, aside from the Benchmark specific non-OS SOTA results on WebArena. An asterisk is shown in this case to depict that the non-open-source solutions provide no documentation or implementation for the community.
Evaluation results of Magentic-One on the GAIA, AssistantBench and WebArena. Error bars indicate 95% confidence intervals. Note that WebArena results are self-reported.

Risks and mitigations

Agentic systems like Magentic-One mark a significant shift in both the opportunities and risks associated with AI. Magentic-One interacts with a digital world designed for humans, taking actions that can change states and potentially lead to irreversible consequences. These inherent and undeniable risks were evident during our testing, where several emerging issues surfaced. For example, during development, a misconfiguration led agents to repeatedly attempt and fail to log into a WebArena website. This resulted in the account being temporarily suspended. The agents then tried to reset the account’s password. Even more concerning were cases in which agents, until explicitly stopped, attempted to recruit human assistance by posting on social media, emailing textbook authors, or even drafting a freedom of information request to a government entity. In each case, the agents were unsuccessful due to a lack of the required tools or accounts, or because human observers intervened.

Aligned with the Microsoft AI principles and Responsible AI practices, we worked to identify, measure, and mitigate potential risks before deploying Magentic-One. Specifically, we conducted red-teaming exercises to assess risks related to harmful content, jailbreaks, and prompt injection attacks, finding no increased risk from our design. Additionally, we provide cautionary notices and guidance for using Magentic-One safely, including examples and appropriate default settings. Users are advised to keep humans in the loop for monitoring, and ensure that all code execution examples, evaluations, and benchmarking tools are run in sandboxed Docker containers to minimize risks.

Recommendations and looking forward

We recommend using Magentic-One with models that have strong alignment, pre- and post-generation filtering, and closely monitored logs during and after execution. In our own use, we follow the principles of least privilege and maximum oversight. Minimizing risks associated with agentic AI will require new ideas and extensive research, as much work is still needed to understand these emerging risks and develop effective mitigations. We are committed to sharing our learnings with the community and evolving Magentic-One in line with the latest safety research.

As we look ahead, there are valuable opportunities to improve agentic AI, particularly in safety and Responsible AI research. Agents acting on the public web may be vulnerable to phishing, social engineering, and misinformation threats, much like human users. To counter these risks, an important direction is to equip agents with the ability to assess the reversibility of their actions—distinguishing between those that are easily reversible, those that require effort, and those that are irreversible. Actions like deleting files, sending emails, or filing forms are often difficult or impossible to undo. Systems should therefore be designed to pause and seek human input before proceeding with such high-risk actions.

We invite the community to collaborate with us in ensuring that future agentic systems are both helpful and safe.

For further information, results and discussion, please see our technical report. (opens in new tab)

decorative image and the text

Azure AI Foundry Labs

Get a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.

The post Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks appeared first on Microsoft Research.

]]>
OmniParser for pure vision-based GUI agent http://approjects.co.za/?big=en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/ Tue, 08 Oct 2024 22:31:18 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1091139 By Yadong Lu, Senior Researcher; Jianwei Yang, Principal Researcher; Yelong Shen, Principal Research Manager; Ahmed Awadallah, Partner Research Manager Recent advancements in large vision-language models (VLMs), such as GPT-4V and GPT-4o, have demonstrated considerable promise in driving intelligent agent systems that operate within user interfaces (UI). However, the full potential of these multimodal models remains […]

The post OmniParser for pure vision-based GUI agent appeared first on Microsoft Research.

]]>
By Yadong Lu, Senior Researcher; Jianwei Yang, Principal Researcher; Yelong Shen, Principal Research Manager; Ahmed Awadallah, Partner Research Manager

Recent advancements in large vision-language models (VLMs), such as GPT-4V and GPT-4o, have demonstrated considerable promise in driving intelligent agent systems that operate within user interfaces (UI). However, the full potential of these multimodal models remains underexplored in real-world applications, particularly when it comes to acting as general agents across diverse operating systems and applications with only vision input. One of the primary limiting factors is the absence of a robust technique for screen parsing which is capable of 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen.

Meet OmniParser, a compact screen parsing module that can convert UI screenshots into structured elements. OmniParser can be used with a variety of models to create agents capable of taking actions on UIs. When used with GPT-4V, it significantly improves the agent capability to generate precisely grounded actions for interface regions.

An agent using OmniParser and GPT-4V achieved the best performance on the recently released WindowsAgentArena (opens in new tab) benchmark.

We are making OmniParser publicly available on GitHub, along with a report describing the training procedure to encourage research on creating agents that can act on different applications and environments.

Creating OmniParser

Curating Specialized Datasets–The development of OmniParser began with the creation of two datasets:

  • An interactable icon detection dataset, which was curated from popular web pages and annotated to highlight clickable and actionable regions.
  • An icon description dataset, designed to associate each UI element with its corresponding function. This dataset serves as a key component for training models to understand the semantics of detected elements.

Fine-Tuning Detection and Captioning Models–OmniParser leverages two complementary models:

  • A detection model, fine-tuned on the interactable icon dataset, which reliably identifies actionable regions within a screenshot.
  • A captioning model, trained on the icon description dataset, which extracts the functional semantics of the detected elements, generating contextually accurate descriptions of their intended actions.

Benchmark performance

We demonstrate that with the parsed results, the performance of GPT-4V is greatly improved on ScreenSpot benchmarks. On Mind2Web, OmniParser +GPT-4V achieves better performance compared to GPT-4V agent that uses extra information extracted from HTML. And on AITW benchmark, OmniParser outperforms GPT-4V augmented with specialized Android icon detection model that is trained with view hierarchy. It also achieves the best performance on a new benchmark WindowsAgentArena (opens in new tab)!

OmniParser chart, bar chart showing average performance across benchmarks
OmniParser bar chart showing plugin ready for other vision language models

To further demonstrate OmniParser is a plugin choice for off-the-shelf vision language models, we show the ScreenSpot benchmark performance of OmniParser combined with recently announced vision language models: Phi-3.5-V and Llama-3.2-V. We hope OmniParser can serve as a general and easy-to-use tool that has the capability to parse general user screen across both PC and mobile platforms without any dependency on extra information such as HTML and view hierarchy in Android.

The post OmniParser for pure vision-based GUI agent appeared first on Microsoft Research.

]]>
Direct Nash Optimization: Teaching language models to self-improve with general preferences http://approjects.co.za/?big=en-us/research/articles/direct-nash-optimization-teaching-language-models-to-self-improve-with-general-preferences/ Tue, 03 Sep 2024 19:07:10 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1079043 This talk discusses teaching language models to self-improve using a preference oracle like GPT-4, framing it as a two-player game to find an optimal policy at a Nash equilibrium, and achieving state-of-the-art win rates against GPT-4 Turbo on benchmarks such as AlpacaEval and MT-Bench.

The post Direct Nash Optimization: Teaching language models to self-improve with general preferences appeared first on Microsoft Research.

]]>
Presented by Corby Rosset at Microsoft Research Forum, September 2024

Corby Rosset

“The traditional way to fine-tune an LLM for post-training … basically tells the model to emulate good behaviors, but it does not target or correct any mistakes or bad behaviors that it makes explicitly. … Self-improving post-training explicitly identifies and tries to correct bad behaviors or mistakes that the model makes.”

Corby Rosset, Senior Researcher, Microsoft Research AI Frontiers

Transcript: Lightning Talk

Direct Nash Optimization: Teaching language models to self-improve with general preferences

Corby Rosset, Senior Researcher, Microsoft Research AI Frontiers

This talk discusses teaching language models to self-improve using a preference oracle like GPT-4, framing it as a two-player game to find an optimal policy at a Nash equilibrium, and achieving state-of-the-art win rates against GPT-4 Turbo on benchmarks such as AlpacaEval and MT-Bench.

Microsoft Research Forum, September 3, 2024

CORBY ROSSET: Hi, I’m Corby. I’m a scientist in Microsoft Research. Today, we’re going to be talking about Direct Nash Optimization, which is a technique to help language models self-improve.

We all know that there are two main ways to improve language models. One is to scale up the number of parameters or to scale up the amount of training data. Both of these approaches are costly even for the post-training techniques. The traditional way to fine-tune an LLM for post-training is using SFT. SFT basically tells the model to emulate good behaviors, but it does not target or correct any mistakes or bad behaviors that it makes explicitly. More advanced post-training techniques such as RLHF use a fixed reward model, which can be easily hacked or go stale during training and involves much more complex reinforcement learning, which can be unstable. Self-improving post-training explicitly identifies and tries to correct bad behaviors or mistakes that the model makes.

Before we move on, we want to give a concrete example of what we mean by self-improving behavior. Here’s a simple geometry problem where a base model that was already SFTed makes a simple arithmetic error on the left-hand side. After our self-improving technique, the model is able to correct this mistake.

Here we give a simple overview of how Direct Nash Optimization works. One of the properties of generative LLMs is that you can sample multiple outputs from them. This is advantageous because what we can do is, given an input, we can take our language model and sample, in this case, two outputs—answer A and answer B—and we can have them scored or rated by a preference function oracle, which tells us which response is better. Then we can use a contrastive training mechanism, such as DPO or IPO or others to update the parameters of the language model to hopefully improve it. In the next iteration, timestep t+1, we repeat the process over again. The key insight of this technique is how we define reward. Typically, in the RLHF framework, we want to maximize the reward of a language model policy against some given external reward model. Here, we redefine “reward” as the expected win rate against your own behavior as judged by a preference function P. What this means is that for a given response y to an input x, the reward of that response is defined as the expected win rate against y primes sampled from the policy itself. Hence, rewards are maximized by responses that are preferred over other responses.

When you start comparing the y primes, or the model’s own outputs to each other, this incentivizes a self-improving behavior because you’re basically competing against yourself. You can formulate this in a game theoretic manner where, in this game, you have a single player which is competing against itself, and the payoffs are given by the preference function. In this game, a Nash equilibrium is achieved by the best possible π* whose responses are preferred over any other competing policy in its class.

At a high level, Direct Nash Optimization has many advantages. Firstly, it optimizes towards a more general preference function directly rather than a point-wise reward model, which is limited in its expressibility since it can’t model transitive preferences. Secondly, it is an iterative algorithm, meaning it is much simpler to implement. We use a contrastive update as the loss, which does not involve any policy gradients or heavy reinforcement learning machinery. We also sample on policy outputs from the model and compare them to each other in a self-play framework. We use a powerful preference annotator—in this case, GPT-4—to rank or judge the best response among them. This approach is also flexible since we can compare the responses to each other but also to outputs from a more powerful teacher such as GPT-4, which provides even bigger improvements. Most importantly, this algorithm is theoretically guaranteed to monotonically approach the Nash equilibrium, hence the name Direct Nash Optimization.

If you implement this algorithm correctly, you will find state-of-the-art results on several benchmarks, including this one, which is AlpacaEval2. This benchmark basically measures how well language models follow instructions and align with human expectations. This benchmark computes a win rate of the language model’s outputs versus a powerful reference—in this case, GPT-4—in a side-by-side comparison. The y-axis is the win rate, and the x-axis is the amount of iterations of training. We see that the dark blue line, which is DNO, the vanilla implementation, outperforms two important baselines. The red line is SFT, and the orange and yellow lines are offline contrastive algorithms, such as DPO and KTO. Hence, we see that self-improving post-training is better than offline contrastive training and SFT. Notably, DNO is also able to outperform similar training techniques from other models, which were 10 times as large, namely the gray line, which was a 70 billion parameter Llama model. We are also encouraged to see that these results do not saturate, and with more training in the purple line over more iterations, we see even better results.

We hope this work inspires other researchers to continue to investigate self-improving post-training as an effective method for aligning language models with human expectations. Thank you for watching.

The post Direct Nash Optimization: Teaching language models to self-improve with general preferences appeared first on Microsoft Research.

]]>
AutoGen Update: Complex Tasks and Agents http://approjects.co.za/?big=en-us/research/articles/autogen-update-complex-tasks-and-agents/ Tue, 04 Jun 2024 18:08:31 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1035039 Adam Fourney discusses the effectiveness of using multiple agents, working together, to complete complex multi-step tasks. He will showcase their capability to outperform previous single-agent solutions on benchmarks like GAIA, utilizing customizable arrangements of agents that collaborate, reason, and utilize tools to achieve complex outcomes.

The post AutoGen Update: Complex Tasks and Agents appeared first on Microsoft Research.

]]>
Presented by Adam Fourney at Microsoft Research Forum, June 2024

Adam Fourney

“Agents are a very, very powerful abstraction over things like task decomposition, specialization, tool use, etc. Really, you think about which roles you need on your team, and you put together your team of agents, and you get them to talk to one another, and then you start making progress on your task.”

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers

Transcript: Lightning Talk

AutoGen Update: Complex Tasks and Agents

Adam Fourney, Principal Researcher, Microsoft Research AI Frontiers

Adam Fourney discusses the effectiveness of using multiple agents, working together, to complete complex multi-step tasks. He will showcase their capability to outperform previous single-agent solutions on benchmarks like GAIA, utilizing customizable arrangements of agents that collaborate, reason, and utilize tools to achieve complex outcomes.

Microsoft Research Forum, June 4, 2024

ADAM FOURNEY: Hello, my name is Adam Fourney, and today, I’ll be presenting our work on completing complex tasks with agents. And though I’m presenting, I’m sharing the contributions of many individuals as listed below. All right, so let’s just dive in.

So in this presentation, I’ll share our goal, which is to reliably accomplish long-running complex tasks using large foundational models. I’ll explain the bet that we’re taking on using multi-agent workflows as the platform or the vehicle to get us there, and I’ll share a little bit about our progress in using a four-agent workflow to achieve state-of-the-art performance on a recent benchmark.

So what exactly is a complex task? Well, if we take a look at the following example from the GAIA benchmark for General AI Assistants, it reads, “How many nonindigenous crocodiles were found in Florida from the years 2000 through 2020?” Well, to solve this task, we might begin by performing a search and discovering that the U.S. Geological Survey maintains an online database for nonindigenous aquatic species. If we access that resource, we can form an appropriate query, and we’ll get back results for two separate species. If we open the collection reports for each of those species, we’ll find that in one instance, five crocodiles were encountered, and in the other, just a single crocodile was encountered, giving a total of six separate encounters during those years. So this is an example of a complex task, and it has certain characteristics of tasks of this nature, which is that it benefits strongly from planning, acting, observing, and reflecting over multiple steps, where those steps are doing more than just generating tokens. Maybe they’re executing code. Maybe they’re using tools or interacting with the environment. And the observations they’re doing … they’re adding information that was previously unavailable. So these are the types of tasks that we’re interested in here. And as I mentioned before, we’re betting on using multi-agent workflows as the vehicle to get us there.

So why multi-agents? Well, first of all, the whole setup feels very agentic from, sort of, a first-principles point of view. The agents are reasoning, they’re acting, and then they’re observing the outcomes of their actions. So this is very natural. But more generally, agents are a very, very powerful abstraction over things like task decomposition, specialization, tool use, etc. Really, you think about which roles you need on your team, and you put together your team of agents, and you get them to talk to one another, and then you start making progress on your task. So to do all this, to build all this, we are producing a platform called AutoGen (opens in new tab), which is open source and available on GitHub. And I encourage you to check this out at the link below.

All right, so now let’s talk about the progress we’ve been making using this approach. So if you recall that question about crocodiles from the beginning, that’s from the GAIA benchmark for General AI Assistants. And we put together four agents to work on these types of problems. It consists of a general assistant, a computer terminal that can run code or execute programs, a web server that can browse the internet, and an orchestrator to, sort of, organize and oversee their work. Now with that team of four agents, we were actually able to, in March, achieve the top results on the GAIA leaderboard for that benchmark by about 8 points. But what’s perhaps more exciting to us is that we are able to more than double the performance on the hardest set of questions, the Level 3 questions, which the authors of that work describe as questions for a perfect general assistant, requiring to take arbitrarily long sequences of actions, use any number of tools, and to access the world in general. So this is all very exciting, and I want to share a little bit more about what those agents are actually doing.

So this is the loop or the plan that they are following. So it begins with the question or the prompt, and then we produce a ledger, which is like a working memory that consists of given or verified facts; facts that we need to look up, for example, on the internet; facts that we need to derive, perhaps through computation; and educated guesses. Now these educated guesses turn out to be really important because they give the language models space to speculate in a constrained environment without some of the downstream negative effects of hallucination. So once we have that ledger, we assign the tasks to the independent agents, and then we go into this inner loop, where we ask first, are we done? If not, well, are we still making progress? As long as we’re making progress, we’ll go ahead and we’ll delegate the next step to the next agent. But if we’re not making progress, we’ll note that down. We might still delegate one other step, but if that stall occurs for three rounds, then we will actually go back, update the ledger, come up with a new set of assignments for the agents, and then start over.

All right, so this is the configuration that’s been working well for us, and it’s all I have time to share with you today. But I mentioned our goal, our bet, and our progress, and I want to conclude by sharing our plans for the future. So already we’re starting to tackle increasingly more complex benchmarks and real-world scenarios with this configuration. And we’re really excited about opportunities to introduce new agents that, for example, learn and self-improve with experience; that understand images and screenshots a little better for maybe more effective web surfing or use of interfaces; and that are maybe a bit more systematic about exploring that solution space. So rather than just updating that ledger and then restarting when they get stuck, they can be a bit more pragmatic about the strategies that they’re employing.
All right, well, thank you for your attention, and thank you for attending the Microsoft Research Forum, and we look forward to you joining us next time.

The post AutoGen Update: Complex Tasks and Agents appeared first on Microsoft Research.

]]>