{"id":987321,"date":"2023-12-07T09:00:00","date_gmt":"2023-12-07T17:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=987321"},"modified":"2023-12-07T13:47:00","modified_gmt":"2023-12-07T21:47:00","slug":"llmlingua-innovating-llm-efficiency-with-prompt-compression","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/llmlingua-innovating-llm-efficiency-with-prompt-compression\/","title":{"rendered":"LLMLingua: Innovating LLM efficiency with prompt compression"},"content":{"rendered":"\n<p class=\"has-text-align-center\"><strong><em>This research paper was presented at the <\/em><\/strong><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/2023.emnlp.org\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>2023 Conference on Empirical Methods in Natural Language Processing<\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><strong><em> (EMNLP 2023), the premier conference on natural language processing and artificial intelligence.<\/em><\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1.png\" alt=\"EMNLP 2023 logo to the left of accepted paper \"LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models\" on a blue\/green gradient background\" class=\"wp-image-987333\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-240x135.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-1280x720.png 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p>As large language models (LLMs) models advance and their potential becomes increasingly apparent, an understanding is emerging that the quality of their output is directly related to the nature of the prompt that is given to them. This has resulted in the rise of prompting technologies, such as chain-of-thought (CoT) and in-context-learning (ICL), which facilitate an increase in prompt length. In some instances, prompts now extend to tens of thousands of tokens, or units of text, and beyond. While longer prompts hold considerable potential, they also introduce a host of issues, such as the need to exceed the chat window\u2019s maximum limit, a reduced capacity for retaining contextual information, and an increase in API costs, both in monetary terms and computational resources.<\/p>\n\n\n\n<p>To address these challenges, we introduce a prompt-compression method in our paper, \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/llmlingua-compressing-prompts-for-accelerated-inference-of-large-language-models\/\">LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u201d presented at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/2023.emnlp.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">EMNLP 2023<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Using a well-trained small language model, such as GPT2-small or LLaMA-7B, LLMLingua identifies and removes unimportant tokens from prompts. This compression technique enables closed LLMs to make inferences from the compressed prompt. Although the token-level compressed prompts may be difficult for humans to understand, they prove highly effective for LLMs. This is illustrated in Figure 1.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"3340\" height=\"1938\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure1.png\" alt=\"This is an illustration of the LLMLingua framework, which estimates the important tokens of a prompt based on a small language model. It consists of three modules: a budget controller, iterative token-level prompt compression, and distribution alignment. The framework can compress a complex prompt of 2,366 tokens down to 117 tokens, achieving a 20x compression while maintaining almost unchanged performance. \" class=\"wp-image-988035\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure1.png 3340w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure1-300x174.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure1-1024x594.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure1-768x446.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure1-1536x891.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure1-2048x1188.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure1-480x280.png 480w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure1-240x139.png 240w\" sizes=\"auto, (max-width: 3340px) 100vw, 3340px\" \/><figcaption class=\"wp-element-caption\">Figure 1. LLMLingua\u2019s framework<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"llmlingua-s-method-and-evaluation\">LLMLingua\u2019s method and evaluation<\/h2>\n\n\n\n<p>To develop LLMLingua\u2019s framework, we employed a budget controller to balance the sensitivities of different modules in the prompt, preserving the language&#8217;s integrity. Our two-stage process involved course-grained prompt compression. We first streamlined the prompt by eliminating certain sentences and then individually compressed the remaining tokens. To preserve coherence, we employed an iterative token-level compression approach, refining the individual relationships between tokens. Additionally, we fine-tuned the smaller model to capture the distribution information from different closed LLMs by aligning it with the patterns in the LLMs\u2019 generated data. We did this through instruction tuning.<\/p>\n\n\n\n<p>To assess LLMLingua\u2019s performance, we tested compressed prompts on four different datasets, GSM8K, BBH, ShareGPT, and Arxiv-March23, encompassing ICL, reasoning, summarization, and conversation. Our approach achieved impressive results, achieving up to 20x compression while preserving the original prompt&#8217;s capabilities, particularly in ICL and reasoning. LLMLingua also significantly reduced system latency.<\/p>\n\n\n\n<p>During our test, we used LLaMA-7B as the small language model and GPT-3.5-Turbo-0301, one of OpenAI\u2019s LLMs, as the closed LLM. The results show that LLMLingua maintains the original reasoning, summarization, and dialogue capabilities of the prompt, even at a maximum compression ratio of 20x, as reflected in the evaluation metric (EM) columns in Tables 1 and 2. At the same time, other compression methods failed to retain key semantic information in prompts, especially in logical reasoning details. For a more in-depth discussion of these results, refer to section 5.2 of the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/llmlingua-compressing-prompts-for-accelerated-inference-of-large-language-models\/\" target=\"_blank\" rel=\"noreferrer noopener\">paper<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1396\" height=\"1480\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable1.png\" alt=\"These are the experimental results on GSM8K and BBH using GPT-3.5-turbo, demonstrating the in-context learning and reasoning capabilities based on different methods and compression constraints. The results show that LLMLingua can achieve up to a 20x compression rate while only experiencing a 1.5-point performance loss. \" class=\"wp-image-988044\" style=\"width:554px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable1.png 1396w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable1-283x300.png 283w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable1-966x1024.png 966w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable1-768x814.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable1-170x180.png 170w\" sizes=\"auto, (max-width: 1396px) 100vw, 1396px\" \/><figcaption class=\"wp-element-caption\">Table 1. Performance of different methods at different target compression ratios on the GSM8K and BBH datasets.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1706\" height=\"468\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable2.png\" alt=\"These are the experimental results for ShareGPT (Conversation) and Arxiv-March23 (Summarization) using GPT-3.5-turbo, based on different methods and compression constraints. The results indicate that LLMLingua can effectively retain the semantic information from the original prompts while achieving a compression rate of 3x-9x. \" class=\"wp-image-988053\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable2.png 1706w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable2-300x82.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable2-1024x281.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable2-768x211.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable2-1536x421.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable2-240x66.png 240w\" sizes=\"auto, (max-width: 1706px) 100vw, 1706px\" \/><figcaption class=\"wp-element-caption\">Table 2. Performance of different methods at different target compression ratios for conversation and summarization tasks.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"llmlingua-is-robust-cost-effective-efficient-and-recoverable\">LLMLingua is robust, cost-effective, efficient, and recoverable<\/h2>\n\n\n\n<p>LLMLingua also showed impressive results across various small language models and different closed LLMs. When using GPT-2-small, LLMLingua achieved a strong performance score of 76.27 under the \u00bc-shot constraint, close to the LLaMA-7B&#8217;s result of 77.33 and surpassing the standard prompt results of 74.9. Similarly, even without aligning Claude-v1.3, one of the post powerful LLMs, LLMLingua\u2019s score was 82.61 under the \u00bd-shot constraint, outperforming the standard prompt result of 81.8.<\/p>\n\n\n\n<p>LLMLingua also proved effective in reducing response length, leading to significant reductions in latency in the LLM\u2019s generation process, with reductions ranging between 20 to 30 percent, as shown in Figure 2.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"12000\" height=\"8000\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure2.png\" alt=\"The figure demonstrates the relationship between the compression ratio and the number of response tokens. In different tasks, as the compression ratio increases, the response length decreases to varying extents, with a maximum reduction of 20%-30%. \" class=\"wp-image-988059\" style=\"width:616px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure2.png 12000w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure2-300x200.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure2-1024x683.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure2-768x512.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure2-1536x1024.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure2-2048x1365.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguafigure2-240x160.png 240w\" sizes=\"auto, (max-width: 12000px) 100vw, 12000px\" \/><figcaption class=\"wp-element-caption\">Figure 2. The distribution of token lengths generated at varying compression ratios.<\/figcaption><\/figure>\n\n\n\n<p>What makes LLMLingua even more impressive is its recoverability feature. When we used GPT-4 to restore the compressed prompts, it successfully recovered all key reasoning information from the full nine-step chain-of-thought (CoT) prompting, which enables LLMs to address problems through sequential intermediate steps. The recovered prompt was almost identical to the original, and its meaning was retained. This is shown in Tables 3 and 4.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1780\" height=\"2538\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable3.png\" alt=\"This figure illustrates the original prompt, the compressed prompt, and the result of using GPT-4 to recover the compressed prompt. The original prompt consists of a 9-step Chain-of-Thought, and the compressed prompt is difficult for humans to understand. However, the recovered text includes all 9 steps of the Chain-of-Thought. \" class=\"wp-image-988062\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable3.png 1780w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable3-210x300.png 210w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable3-718x1024.png 718w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable3-768x1095.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable3-1077x1536.png 1077w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable3-1436x2048.png 1436w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable3-126x180.png 126w\" sizes=\"auto, (max-width: 1780px) 100vw, 1780px\" \/><figcaption class=\"wp-element-caption\">Table 3. Latency comparison on GSM8K. LLMLingua can accelerate LLMs&#8217; end-to-end inference by a factor of 1.7\u20135.7x.&nbsp;<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1391\" height=\"378\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable4.png\" alt=\"This figure shows the end-to-end latency when using LLMLingua, without using LLMLingua, and the latency when compressing prompts. As the compression ratio increases, both the LLMLingua and end-to-end latency decrease, achieving up to a 5.7x acceleration with a 10x token compression rate. \" class=\"wp-image-988068\" style=\"width:584px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable4.png 1391w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable4-300x82.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable4-1024x278.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable4-768x209.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/LLMLinguatable4-240x65.png 240w\" sizes=\"auto, (max-width: 1391px) 100vw, 1391px\" \/><figcaption class=\"wp-element-caption\">Table 4. Recovering the compressed prompt from GSM8K using GPT-4.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"enhancing-the-user-experience-and-looking-ahead\">Enhancing the user experience and looking ahead<\/h2>\n\n\n\n<p>LLMLingua is already proving its value through practical application. It has been integrated into <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/run-llama\/llama_index\/blob\/main\/llama_index\/indices\/postprocessor\/longllmlingua.py\" target=\"_blank\" rel=\"noreferrer noopener\">LlamaIndex<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a widely adopted retrieval-augmented generation (RAG) framework. Currently, we are collaborating with product teams to reduce the number of tokens required in LLM calls, particularly for tasks like multi-document question-answering. Here, our goal is to significantly improve the user experience with LLMs.&nbsp;<\/p>\n\n\n\n<p>For the long-term, we have proposed <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/longllmlingua-accelerating-and-enhancing-llms-in-long-context-scenarios-via-prompt-compression\/\">LongLLMLingua<\/a>, a prompt-compression technique designed for long-context scenarios, such as retrieval-augmented question-answering tasks in applications like chatbots, useful when information evolves dynamically over time. It&#8217;s also geared for tasks like summarizing online meetings. LongLLMLingua\u2019s primary objective is to enhance LLMs&#8217; ability to perceive key information, making it suitable for numerous real-world applications, notably information-based chatbots. We\u2019re hopeful that this innovation paves the way for more sophisticated and user-friendly interactions with LLMs.<\/p>\n\n\n\n<p>Learn more about our work on the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/llmlingua.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">LLMLingua<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> page.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Advanced prompting technologies for LLMs can lead to excessively long prompts, causing issues. Learn how LLMLingua compresses prompts up to 20x, maintaining quality, reducing latency, and supporting improved UX.<\/p>\n","protected":false},"author":42183,"featured_media":987333,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-987321","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199560],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[793670,815140],"related-projects":[978333],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Huiqiang Jiang","user_id":40807,"display_name":"Huiqiang Jiang","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/hjiang\/\" aria-label=\"Visit the profile page for Huiqiang Jiang\">Huiqiang Jiang<\/a>","is_active":false,"last_first":"Jiang, Huiqiang","people_section":0,"alias":"hjiang"},{"type":"user_nicename","value":"Qianhui Wu","user_id":40741,"display_name":"Qianhui Wu","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/qianhuiwu\/\" aria-label=\"Visit the profile page for Qianhui Wu\">Qianhui Wu<\/a>","is_active":false,"last_first":"Wu, Qianhui","people_section":0,"alias":"qianhuiwu"},{"type":"user_nicename","value":"Chin-Yew Lin","user_id":31493,"display_name":"Chin-Yew Lin","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/cyl\/\" aria-label=\"Visit the profile page for Chin-Yew Lin\">Chin-Yew Lin<\/a>","is_active":false,"last_first":"Lin, Chin-Yew","people_section":0,"alias":"cyl"},{"type":"user_nicename","value":"Yuqing Yang","user_id":40654,"display_name":"Yuqing Yang","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yuqyang\/\" aria-label=\"Visit the profile page for Yuqing Yang\">Yuqing Yang<\/a>","is_active":false,"last_first":"Yang, Yuqing","people_section":0,"alias":"yuqyang"},{"type":"user_nicename","value":"Lili Qiu","user_id":41320,"display_name":"Lili Qiu","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/liliqiu\/\" aria-label=\"Visit the profile page for Lili Qiu\">Lili Qiu<\/a>","is_active":false,"last_first":"Qiu, Lili","people_section":0,"alias":"liliqiu"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-960x540.png\" class=\"img-object-cover\" alt=\"EMNLP 2023 logo to the left of accepted paper &quot;LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models&quot; on a blue\/green gradient background\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-240x135.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/EMNLP-2023-BlogHeroFeature-1400x788-1.png 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"December 7, 2023","formattedExcerpt":"Advanced prompting technologies for LLMs can lead to excessively long prompts, causing issues. Learn how LLMLingua compresses prompts up to 20x, maintaining quality, reducing latency, and supporting improved UX.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/987321","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=987321"}],"version-history":[{"count":30,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/987321\/revisions"}],"predecessor-version":[{"id":988929,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/987321\/revisions\/988929"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/987333"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=987321"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=987321"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=987321"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=987321"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=987321"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=987321"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=987321"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=987321"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=987321"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=987321"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=987321"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}