{"id":1162678,"date":"2026-02-20T09:39:02","date_gmt":"2026-02-20T17:39:02","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=1162678"},"modified":"2026-02-20T10:38:33","modified_gmt":"2026-02-20T18:38:33","slug":"experiential-reinforcement-learning","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/experiential-reinforcement-learning\/","title":{"rendered":"Experiential Reinforcement Learning"},"content":{"rendered":"\n
By Taiwei Shi, Sihao Chen<\/a>, Longqi Yang<\/a>, Jaime Teevan<\/a><\/p>\n\n\n\n Reinforcement Learning is at the core of building and improving frontier AI models and products. Yet most state-of-the-art RL methods learn primarily from outcomes: a scalar reward signal that says whether an attempt worked, not why it failed. When an agent writes code that doesn\u2019t compile, for example, it may only receive a 0\/1 score (\u201cfailed\u201d vs. \u201cworked\u201d). Lacking an explanation or a concrete path to correction, the agent must try again, often thousands of times, until incremental parameter updates eventually produce a successful solution. This is particularly problematic for collaborative scenarios that are social, long-horizon, and hard to score.<\/p>\n\n\n\n Humans don\u2019t learn that way. When you get better at collaborating, for example, it\u2019s rarely by seeing success or failure alone; you talk through what went wrong, share context, and adjust together. Teamwork improves through reflection, not just outcomes. Today\u2019s AI agents largely lack this reflective loop. Experiential Reinforcement Learning (ERL)<\/strong> asks: what if an agent could pause, reflect on its mistakes, and use those insights to improve?<\/p>\n\n\n\n Rather than relying on imitation or blind retries, ERL teaches an agent to turn feedback into structured behavioral revision<\/strong>. The method follows five steps:<\/p>\n\n\n\n The underlying idea isn\u2019t new; it\u2019s new in this context. In the 1980s, education researcher David Kolb argued that people learn most effectively by cycling through experience and reflection: you have a concrete experience, reflect on what happened, form a revised understanding, and then try again. That cycle (experience, reflect, conceptualize, experiment) helps explain why one student learns from a failed exam while another simply retakes it. ERL can be seen as a computational version of Kolb\u2019s cycle: the first attempt is the concrete experience; the reflection is the reflective observation; the revised second attempt puts a new conceptualization into practice. Finally, the internalization step, where successful corrections are distilled back into the policy, mirrors how people eventually stop needing to consciously work through the cycle because the lesson becomes automatic.<\/p>\n\n\n\n Across agentic reasoning and tool-use tasks, ERL consistently outperforms standard RL. The largest gains appear in settings with minimal upfront instruction, environments where the agent must infer the \u201crules of the game\u201d through interaction. In these open-ended regimes, reflection and revision become a primary driver of learning, and ERL is most valuable precisely where outcome-only RL tends to struggle.<\/p>\n\n\n\n Experience-driven learning could become a core primitive for future intelligent systems shifting AI from optimizing outcomes to accumulating understanding through interaction.<\/strong><\/p>\n\n\n\n The real promise of ERL points to a future where AI learns to collaborate with people. Human collaboration isn\u2019t a fixed environment with a clean reward signal; it\u2019s fluid, social, and deeply contextual. A good collaborator reads the room, adapts to a partner\u2019s working style, recovers gracefully from misunderstandings, and builds a shared history of what works.<\/p>\n\n\n\n Today\u2019s AI agents don\u2019t do much of that; they often treat each interaction as if it\u2019s the first. With ERL, an agent could reflect on why a conversation went sideways, revise its approach, and internalize the lesson for next time. Over time, it might learn that one user prefers concise answers, while another values detailed reasoning, and it could adapt accordingly. In effect, the agent\u2019s way of working with you could become more personalized and reliable, like a trusted colleague.<\/p>\n\n\n\n ERL offers a concrete mechanism, not just a vision, for how AI might get there: not by hard-coding social rules, but by learning them the way people do, through experience.<\/p>\n\n\n\n
<\/figure>\n\n\n\nThe core idea: learning through experience<\/h3>\n\n\n\n
\n
How ERL mirrors how humans learn from experience<\/h3>\n\n\n\n
<\/figure>\n\n\n\nResults<\/h3>\n\n\n\n
<\/figure>\n\n\n\n
<\/figure>\n\n\n\nLooking ahead: learning through interaction in human-AI collaboration<\/h3>\n\n\n\n
Learn More<\/h3>\n\n\n\n