{"id":1030206,"date":"2024-05-08T09:00:00","date_gmt":"2024-05-08T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1030206"},"modified":"2024-05-08T10:52:37","modified_gmt":"2024-05-08T17:52:37","slug":"llm-profiling-guides-kv-cache-optimization","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/llm-profiling-guides-kv-cache-optimization\/","title":{"rendered":"LLM profiling guides KV cache optimization"},"content":{"rendered":"\n

This research paper was presented at the <\/em><\/strong>12th<\/sup> International Conference on Learning Representations<\/em><\/strong> (opens in new tab)<\/span><\/a> (ICLR 2024), the premier conference dedicated to the advancement of deep learning.<\/em><\/strong><\/p>\n\n\n\n

\"White<\/figure>\n\n\n\n

Large language models (LLMs) rely on complex internal mechanisms that require more memory than what is typically available to operate on standard devices. One such mechanism is the key-value (KV) cache, which stores and retrieves previously computed data, helping the model generate responses quickly without needing to recalculate information it has already processed. This method uses a substantial amount of memory because it keeps a large amount of this data readily accessible to enhance the model’s speed and efficiency. Consequently, the KV cache can become prohibitively large as the complexity of the tasks increases, sometimes requiring up to 320 GB for a single operation. To address this, we developed FastGen, a novel method aimed at reducing the memory demands for LLMs.<\/p>\n\n\n\n

\n\t