{"id":1030206,"date":"2024-05-08T09:00:00","date_gmt":"2024-05-08T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1030206"},"modified":"2024-05-08T10:52:37","modified_gmt":"2024-05-08T17:52:37","slug":"llm-profiling-guides-kv-cache-optimization","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/llm-profiling-guides-kv-cache-optimization\/","title":{"rendered":"LLM profiling guides KV cache optimization"},"content":{"rendered":"\n
This research paper was presented at the <\/em><\/strong>12th<\/sup> International Conference on Learning Representations<\/em><\/strong> (opens in new tab)<\/span><\/a> (ICLR 2024), the premier conference dedicated to the advancement of deep learning.<\/em><\/strong><\/p>\n\n\n\n Large language models (LLMs) rely on complex internal mechanisms that require more memory than what is typically available to operate on standard devices. One such mechanism is the key-value (KV) cache, which stores and retrieves previously computed data, helping the model generate responses quickly without needing to recalculate information it has already processed. This method uses a substantial amount of memory because it keeps a large amount of this data readily accessible to enhance the model’s speed and efficiency. Consequently, the KV cache can become prohibitively large as the complexity of the tasks increases, sometimes requiring up to 320 GB for a single operation. To address this, we developed FastGen, a novel method aimed at reducing the memory demands for LLMs.<\/p>\n\n\n\n Our paper, \u201cModel Tells You What to Discard: Adaptive KV Cache Compression for LLMs (opens in new tab)<\/span><\/a>,\u201d presented at ICLR 2024, we describe how FastGen optimizes the way LLMs store and access data, potentially cutting memory use by half while preserving their efficiency. This approach represents a significant step toward making sophisticated AI tools more accessible and affordable for broader applications.\u00a0We are delighted to share that this paper has been given an Honorable Mention for the\u00a0Outstanding Paper Award (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n The development of FastGen is underpinned by our observations of how the KV cache functions. We first observed that not all the data in the KV cache is needed for LLMs to complete their required tasks, as shown in Figure 1. By providing the KV cache with the mechanism to discard unnecessary data, it is possible to significantly cut memory use. For example, some LLM modules don\u2019t require broad contexts to process input. For this, it is possible to construct a KV cache that removes data that contains less important long-range contexts, such as several sentences or paragraphs. Also, some LLM modules primarily attend only to special tokens, such as punctuation, for which it is possible to create a KV cache that retains only those tokens. Finally, some LLM modules broadly need all tokens, and for these we can employ the standard KV cache and store all words. <\/p>\n\n\n\n Another key observation in our study is that attention modules in different layers and positions in the LLM behave differently and need different preferences for their KV cache, as shown on the right in Figure 1. <\/p>\n\n\n\n\t \n\t\tMicrosoft Research Blog<\/span>\n\t<\/p>\n\t\n\t<\/figure>\n\n\n\n
\n\t\t
Observations of the KV cache<\/h2>\n\n\n\n