{"id":987321,"date":"2023-12-07T09:00:00","date_gmt":"2023-12-07T17:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=987321"},"modified":"2023-12-07T13:47:00","modified_gmt":"2023-12-07T21:47:00","slug":"llmlingua-innovating-llm-efficiency-with-prompt-compression","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/llmlingua-innovating-llm-efficiency-with-prompt-compression\/","title":{"rendered":"LLMLingua: Innovating LLM efficiency with prompt compression"},"content":{"rendered":"\n

This research paper was presented at the <\/em><\/strong>2023 Conference on Empirical Methods in Natural Language Processing<\/strong> (opens in new tab)<\/span><\/a> (EMNLP 2023), the premier conference on natural language processing and artificial intelligence.<\/em><\/strong><\/p>\n\n\n\n

\"EMNLP<\/figure>\n\n\n\n

As large language models (LLMs) models advance and their potential becomes increasingly apparent, an understanding is emerging that the quality of their output is directly related to the nature of the prompt that is given to them. This has resulted in the rise of prompting technologies, such as chain-of-thought (CoT) and in-context-learning (ICL), which facilitate an increase in prompt length. In some instances, prompts now extend to tens of thousands of tokens, or units of text, and beyond. While longer prompts hold considerable potential, they also introduce a host of issues, such as the need to exceed the chat window\u2019s maximum limit, a reduced capacity for retaining contextual information, and an increase in API costs, both in monetary terms and computational resources.<\/p>\n\n\n\n

To address these challenges, we introduce a prompt-compression method in our paper, \u201cLLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (opens in new tab)<\/span><\/a>,\u201d presented at EMNLP 2023 (opens in new tab)<\/span><\/a>. Using a well-trained small language model, such as GPT2-small or LLaMA-7B, LLMLingua identifies and removes unimportant tokens from prompts. This compression technique enables closed LLMs to make inferences from the compressed prompt. Although the token-level compressed prompts may be difficult for humans to understand, they prove highly effective for LLMs. This is illustrated in Figure 1.<\/p>\n\n\n\n

\"This
Figure 1. LLMLingua\u2019s framework<\/figcaption><\/figure>\n\n\n\n

LLMLingua\u2019s method and evaluation<\/h2>\n\n\n\n

To develop LLMLingua\u2019s framework, we employed a budget controller to balance the sensitivities of different modules in the prompt, preserving the language’s integrity. Our two-stage process involved course-grained prompt compression. We first streamlined the prompt by eliminating certain sentences and then individually compressed the remaining tokens. To preserve coherence, we employed an iterative token-level compression approach, refining the individual relationships between tokens. Additionally, we fine-tuned the smaller model to capture the distribution information from different closed LLMs by aligning it with the patterns in the LLMs\u2019 generated data. We did this through instruction tuning.<\/p>\n\n\n\n

To assess LLMLingua\u2019s performance, we tested compressed prompts on four different datasets, GSM8K, BBH, ShareGPT, and Arxiv-March23, encompassing ICL, reasoning, summarization, and conversation. Our approach achieved impressive results, achieving up to 20x compression while preserving the original prompt’s capabilities, particularly in ICL and reasoning. LLMLingua also significantly reduced system latency.<\/p>\n\n\n\n

During our test, we used LLaMA-7B as the small language model and GPT-3.5-Turbo-0301, one of OpenAI\u2019s LLMs, as the closed LLM. The results show that LLMLingua maintains the original reasoning, summarization, and dialogue capabilities of the prompt, even at a maximum compression ratio of 20x, as reflected in the evaluation metric (EM) columns in Tables 1 and 2. At the same time, other compression methods failed to retain key semantic information in prompts, especially in logical reasoning details. For a more in-depth discussion of these results, refer to section 5.2 of the paper<\/a>.<\/p>\n\n\n\n

\"These
Table 1. Performance of different methods at different target compression ratios on the GSM8K and BBH datasets.<\/figcaption><\/figure>\n\n\n\n
\"These
Table 2. Performance of different methods at different target compression ratios for conversation and summarization tasks.<\/figcaption><\/figure>\n\n\n\n

LLMLingua is robust, cost-effective, efficient, and recoverable<\/h2>\n\n\n\n

LLMLingua also showed impressive results across various small language models and different closed LLMs. When using GPT-2-small, LLMLingua achieved a strong performance score of 76.27 under the \u00bc-shot constraint, close to the LLaMA-7B’s result of 77.33 and surpassing the standard prompt results of 74.9. Similarly, even without aligning Claude-v1.3, one of the post powerful LLMs, LLMLingua\u2019s score was 82.61 under the \u00bd-shot constraint, outperforming the standard prompt result of 81.8.<\/p>\n\n\n\n

LLMLingua also proved effective in reducing response length, leading to significant reductions in latency in the LLM\u2019s generation process, with reductions ranging between 20 to 30 percent, as shown in Figure 2.<\/p>\n\n\n\n

\"The
Figure 2. The distribution of token lengths generated at varying compression ratios.<\/figcaption><\/figure>\n\n\n\n

What makes LLMLingua even more impressive is its recoverability feature. When we used GPT-4 to restore the compressed prompts, it successfully recovered all key reasoning information from the full nine-step chain-of-thought (CoT) prompting, which enables LLMs to address problems through sequential intermediate steps. The recovered prompt was almost identical to the original, and its meaning was retained. This is shown in Tables 3 and 4.<\/p>\n\n\n\n

\"This
Table 3. Latency comparison on GSM8K. LLMLingua can accelerate LLMs’ end-to-end inference by a factor of 1.7\u20135.7x. <\/figcaption><\/figure>\n\n\n\n
\"This
Table 4. Recovering the compressed prompt from GSM8K using GPT-4.<\/figcaption><\/figure>\n\n\n\n

Enhancing the user experience and looking ahead<\/h2>\n\n\n\n

LLMLingua is already proving its value through practical application. It has been integrated into LlamaIndex (opens in new tab)<\/span><\/a>, a widely adopted retrieval-augmented generation (RAG) framework. Currently, we are collaborating with product teams to reduce the number of tokens required in LLM calls, particularly for tasks like multi-document question-answering. Here, our goal is to significantly improve the user experience with LLMs. <\/p>\n\n\n\n

For the long-term, we have proposed LongLLMLingua<\/a>, a prompt-compression technique designed for long-context scenarios, such as retrieval-augmented question-answering tasks in applications like chatbots, useful when information evolves dynamically over time. It’s also geared for tasks like summarizing online meetings. LongLLMLingua\u2019s primary objective is to enhance LLMs’ ability to perceive key information, making it suitable for numerous real-world applications, notably information-based chatbots. We\u2019re hopeful that this innovation paves the way for more sophisticated and user-friendly interactions with LLMs.<\/p>\n\n\n\n

Learn more about our work on the LLMLingua (opens in new tab)<\/span><\/a> page.<\/p>\n","protected":false},"excerpt":{"rendered":"

Advanced prompting technologies for LLMs can lead to excessively long prompts, causing issues. Learn how LLMLingua compresses prompts up to 20x, maintaining quality, reducing latency, and supporting improved UX.<\/p>\n","protected":false},"author":42183,"featured_media":987333,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-987321","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[793670],"related-projects":[978333],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Huiqiang Jiang","user_id":40807,"display_name":"Huiqiang Jiang","author_link":"Huiqiang Jiang<\/a>","is_active":false,"last_first":"Jiang, Huiqiang","people_section":0,"alias":"hjiang"},{"type":"user_nicename","value":"Qianhui Wu","user_id":40741,"display_name":"Qianhui Wu","author_link":"Qianhui Wu<\/a>","is_active":false,"last_first":"Wu, Qianhui","people_section":0,"alias":"qianhuiwu"},{"type":"user_nicename","value":"Chin-Yew Lin","user_id":31493,"display_name":"Chin-Yew Lin","author_link":"Chin-Yew Lin<\/a>","is_active":false,"last_first":"Lin, Chin-Yew","people_section":0,"alias":"cyl"},{"type":"user_nicename","value":"Yuqing Yang","user_id":40654,"display_name":"Yuqing Yang","author_link":"Yuqing Yang<\/a>","is_active":false,"last_first":"Yang, Yuqing","people_section":0,"alias":"yuqyang"},{"type":"user_nicename","value":"Lili Qiu","user_id":41320,"display_name":"Lili Qiu","author_link":"Lili Qiu<\/a>","is_active":false,"last_first":"Qiu, Lili","people_section":0,"alias":"liliqiu"}],"msr_type":"Post","featured_image_thumbnail":"\"EMNLP","byline":"","formattedDate":"December 7, 2023","formattedExcerpt":"Advanced prompting technologies for LLMs can lead to excessively long prompts, causing issues. Learn how LLMLingua compresses prompts up to 20x, maintaining quality, reducing latency, and supporting improved UX.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/987321"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=987321"}],"version-history":[{"count":30,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/987321\/revisions"}],"predecessor-version":[{"id":988929,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/987321\/revisions\/988929"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/987333"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=987321"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=987321"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=987321"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=987321"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=987321"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=987321"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=987321"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=987321"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=987321"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=987321"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=987321"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}