{"id":1168112,"date":"2026-04-08T13:18:08","date_gmt":"2026-04-08T20:18:08","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=1168112"},"modified":"2026-04-08T14:23:13","modified_gmt":"2026-04-08T21:23:13","slug":"memento-teaching-llms-to-manage-their-own-context","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/memento-teaching-llms-to-manage-their-own-context\/","title":{"rendered":"Memento: Teaching LLMs to Manage Their Own Context"},"content":{"rendered":"
Vasilis Kontonis, Yuchen Zeng, Shivam Garg, Lingjiao Chen, Hao Tang, Ziyan Wang, Ahmed Awadallah, Eric Horvitz, John Langford, Dimitris Papailiopoulos We taught models to compress their own chain-of-thought mid-generation. Peak KV cache drops 2\u20133x, throughput nearly doubles, and the erased reasoning blocks leave traces in the KV cache that the model still uses. Paper, OpenMemento dataset (228K traces), and vLLM fork all open.<\/p>\n\n\n\n If you’re too busy to read this, here’s what we found:<\/p>\n\n\n\n It’s well established at this point that reasoning models can solve hard problems by generating a lot of tokens. Test-time compute works and has led to dramatic advances on competition-level math and coding, but it can also result in a single inference call producing hundreds of thousands of tokens. That is roughly the length of a book. All these tokens stay in memory, attended to at equal cost, whether they lead somewhere or not. The model has no built-in mechanism to compact what it has figured out, keep the conclusions, and move on.<\/p>\n\n\n\n There are ways to manage this externally, e.g., by running a separate summarizer, restart API calls with condensed context, build orchestration logic around the model. However, these are all systems bolted around the model rather than skills the model itself has learned. We think figuring out what to remember and what to forget can and should be a skill that the model learns during training<\/em>.<\/p>\n\n\n\n Memento<\/strong> teaches language models exactly this. A Memento-trained (aka a mementified<\/em>) model segments its reasoning into semantically coherent blocks. When a block is complete, the model produces a memento: a terse, information-dense compression of the block’s conclusions, key intermediate values, formulas, and strategic decisions. Think of a memento as a lemma: a minimal record of what future reasoning steps need to continue.<\/p>\n\n\n\n Once a memento is generated, the preceding thinking block is masked from attention and its KV cache entries are flushed away. From that point on, the model sees only past mementos plus whatever block it is currently working through. This means context grows while the model is reasoning through a block, but then it drops sharply once the memento is produced and the block is evicted. This gives rise to a sawtooth pattern where peak memory stays at a fraction of what a standard flat CoT trace would require. Here’s what this looks like:<\/p>\n\n\n\n
<\/em><\/p>\n\n\n\n
<\/figure>\n\n\n\n\n
The Problem: LLMs Don’t Know How to Manage their Context<\/h2>\n\n\n\n