{"id":1162678,"date":"2026-02-20T09:39:02","date_gmt":"2026-02-20T17:39:02","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1162678"},"modified":"2026-02-20T10:38:33","modified_gmt":"2026-02-20T18:38:33","slug":"experiential-reinforcement-learning","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/experiential-reinforcement-learning\/","title":{"rendered":"Experiential Reinforcement Learning"},"content":{"rendered":"\n<p>By Taiwei Shi, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sihaochen\/\">Sihao Chen<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/loy\/\">Longqi Yang<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/teevan\/\">Jaime Teevan<\/a><\/p>\n\n\n\n<p>Reinforcement Learning is at the core of building and improving frontier AI models and products. Yet most state-of-the-art RL methods learn primarily from outcomes: a scalar reward signal that says whether an attempt worked, not why it failed. When an agent writes code that doesn\u2019t compile, for example, it may only receive a 0\/1 score (\u201cfailed\u201d vs. \u201cworked\u201d). Lacking an explanation or a concrete path to correction, the agent must try again, often thousands of times, until incremental parameter updates eventually produce a successful solution. This is particularly problematic for collaborative scenarios that are social, long-horizon, and hard to score.<\/p>\n\n\n\n<p>Humans don\u2019t learn that way. When you get better at collaborating, for example, it\u2019s rarely by seeing success or failure alone; you talk through what went wrong, share context, and adjust together. Teamwork improves through reflection, not just outcomes. Today\u2019s AI agents largely lack this reflective loop. <strong>Experiential Reinforcement Learning (ERL)<\/strong> asks: what if an agent could pause, reflect on its mistakes, and use those insights to improve?<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"399\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture1-1024x399.png\" alt=\"diagram\" class=\"wp-image-1162679\" style=\"aspect-ratio:2.5665018045715815;width:767px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture1-1024x399.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture1-300x117.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture1-768x299.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture1-240x94.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture1.png 1031w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"the-core-idea-learning-through-experience\">The core idea: learning through experience<\/h3>\n\n\n\n<p>Rather than relying on imitation or blind retries, ERL teaches an agent to turn feedback into <strong>structured behavioral revision<\/strong>. The method follows five steps:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Make an initial attempt: <\/strong>Given a task, the model produces a first response and receives feedback from the environment (including text feedback and a scalar reward).<\/li>\n\n\n\n<li><strong>Get an evaluation:<\/strong> The environment assesses the attempt and returns information about what happened and what should change.<\/li>\n\n\n\n<li><strong>Reflect on what went wrong:<\/strong> When performance is suboptimal, the model generates a structured reflection describing how to improve. The reflection is conditioned on the task, the initial attempt, the feedback, and cross-episode memory of previously successful reflections.<\/li>\n\n\n\n<li><strong>Try again using that insight: <\/strong>Guided by its reflection, the model produces a revised second attempt and receives new feedback and reward.<\/li>\n\n\n\n<li><strong>Internalize what works: <\/strong>Through selective supervised distillation, effective second attempts are absorbed into the base policy. Over time, the model learns to produce improved behavior directly from the original input, without requiring reflection at deployment time.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"how-erl-mirrors-how-humans-learn-from-experience\">How ERL mirrors how humans learn from experience<\/h3>\n\n\n\n<p>The underlying idea isn\u2019t new; it\u2019s new in this context. In the 1980s, education researcher David Kolb argued that people learn most effectively by cycling through experience and reflection: you have a concrete experience, reflect on what happened, form a revised understanding, and then try again. That cycle (experience, reflect, conceptualize, experiment) helps explain why one student learns from a failed exam while another simply retakes it. ERL can be seen as a computational version of Kolb\u2019s cycle: the first attempt is the concrete experience; the reflection is the reflective observation; the revised second attempt puts a new conceptualization into practice. Finally, the internalization step, where successful corrections are distilled back into the policy, mirrors how people eventually stop needing to consciously work through the cycle because the lesson becomes automatic.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"914\" height=\"475\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture2.png\" alt=\"timeline\" class=\"wp-image-1162680\" style=\"aspect-ratio:1.9242333132892364;width:708px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture2.png 914w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture2-300x156.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture2-768x399.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture2-240x125.png 240w\" sizes=\"auto, (max-width: 914px) 100vw, 914px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"results\">Results<\/h3>\n\n\n\n<p>Across agentic reasoning and tool-use tasks, ERL consistently outperforms standard RL. The largest gains appear in settings with minimal upfront instruction, environments where the agent must infer the \u201crules of the game\u201d through interaction. In these open-ended regimes, reflection and revision become a primary driver of learning, and ERL is most valuable precisely where outcome-only RL tends to struggle.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"295\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture3-1024x295.png\" alt=\"chart, bar chart\" class=\"wp-image-1162681\" style=\"aspect-ratio:3.4713433375519798;width:706px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture3-1024x295.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture3-300x86.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture3-768x221.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture3-240x69.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture3.png 1065w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"508\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture4-1024x508.png\" alt=\"chart, line chart\" class=\"wp-image-1162682\" style=\"width:701px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture4-1024x508.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture4-300x149.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture4-768x381.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture4-240x119.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/02\/Picture4.png 1122w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"looking-ahead-learning-through-interaction-in-human-ai-collaboration\">Looking ahead: learning through interaction in human-AI collaboration<\/h3>\n\n\n\n<p>Experience-driven learning could become a core primitive for future intelligent systems shifting AI from optimizing outcomes to <strong>accumulating understanding through interaction.<\/strong><\/p>\n\n\n\n<p>The real promise of ERL points to a future where AI learns to collaborate with people. Human collaboration isn\u2019t a fixed environment with a clean reward signal; it\u2019s fluid, social, and deeply contextual. A good collaborator reads the room, adapts to a partner\u2019s working style, recovers gracefully from misunderstandings, and builds a shared history of what works.<\/p>\n\n\n\n<p>Today\u2019s AI agents don\u2019t do much of that; they often treat each interaction as if it\u2019s the first. With ERL, an agent could reflect on why a conversation went sideways, revise its approach, and internalize the lesson for next time. Over time, it might learn that one user prefers concise answers, while another values detailed reasoning, and it could adapt accordingly. In effect, the agent\u2019s way of working with you could become more personalized and reliable, like a trusted colleague.<\/p>\n\n\n\n<p>ERL offers a concrete mechanism, not just a vision, for how AI might get there: not by hard-coding social rules, but by learning them the way people do, through experience.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"learn-more\">Learn More<\/h3>\n\n\n\n<p>Read the paper: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2602.13949\">\u201cExperiential Reinforcement Learning\u201d <span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Taiwei Shi, Sihao Chen, Longqi Yang, Jaime Teevan Reinforcement Learning is at the core of building and improving frontier AI models and products. Yet most state-of-the-art RL methods learn primarily from outcomes: a scalar reward signal that says whether an attempt worked, not why it failed. When an agent writes code that doesn\u2019t compile, [&hellip;]<\/p>\n","protected":false},"author":38592,"featured_media":1162695,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":1160955,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-1162678","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_assoc_parent":{"id":1160955,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1162678","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38592"}],"version-history":[{"count":7,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1162678\/revisions"}],"predecessor-version":[{"id":1162696,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1162678\/revisions\/1162696"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1162695"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1162678"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1162678"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1162678"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1162678"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}