Effectively Deliver Information to LLMs via Prompt Compression<\/strong><\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n
LLMLingua<\/p>\n\n\n\n
Read More (opens in new tab)<\/span><\/a><\/strong><\/p>\n<\/div>\n\n\n\n
LongLLMLingua<\/p>\n\n\n\n
Read More (opens in new tab)<\/span><\/a><\/strong><\/p>\n<\/div>\n\n\n\n
LLMLingua-2<\/p>\n\n\n\n
Read More (opens in new tab)<\/span><\/a><\/strong><\/p>\n<\/div>\n<\/div>\n\n\n\n
Inspired by the concept of “LLMs as Compressors,” we designed a series of works that try to build a language for LLMs via prompt compression. This approach accelerates model inference, reduces costs, and improves downstream performance while revealing LLM context utilization and intelligence patterns. Our work achieved a 20x compression ratio<\/em> with minimal performance loss(LLMLingua<\/strong>), and a 17.1% performance improvement with 4x compression<\/em> (LongLLMLingua<\/strong>). LLMLingua-2<\/strong> (opens in new tab)<\/span><\/a>, a small-size yet powerful prompt compression method trained via data distillation from GPT-4 for token classification with a BERT-level encoder, excels in task-agnostic compression. It surpasses LLMLingua in handling out-of-domain data, offering 3x-6x faster performance.<\/p>\n\n\n\n
This page is for research demonstration purposes<\/strong> only. <\/p>\n\n\n\n
For more details, please refer to the project pages, LLMLingua<\/strong> (opens in new tab)<\/span><\/a>, LongLLMLingua<\/strong> (opens in new tab)<\/span><\/a>, and LLMLingua-2<\/strong> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n
Paper: https:\/\/arxiv.org\/abs\/2310.05736 (opens in new tab)<\/span><\/a><\/p>\n\n\n\n
Demo: https:\/\/huggingface.co\/spaces\/microsoft\/LLMLingua (opens in new tab)<\/span><\/a><\/p>\n\n\n\n
Project Page: https:\/\/llmlingua.com\/llmlingua.html (opens in new tab)<\/span><\/a><\/p>\n\n\n\n
For more details, please refer to the paper LLMLingua<\/strong> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n
Building on the intuition mentioned earlier, LLMLingua leverages small models’ perplexity to measure the redundancy within a prompt. It has designed three modules, as illustrated above, to assign varying compression rates to different segments within the prompt. This approach takes into account the conditional probabilities between compressed tokens and other tokens to better establish a sensitive distribution. Moreover, to make small models more attuned to various black-box models, LLMLingua introduces an alignment mechanism that aligns small models more closely with the semantic distributions of LLMs.<\/p>\n\n\n\n
LLMLingua offers the following advantages:<\/p>\n\n\n\n
@inproceedings{jiang-etal-2023-llmlingua,\n title = \"{LLML}ingua: Compressing Prompts for Accelerated Inference of Large Language Models\",\n author = \"Huiqiang Jiang and Qianhui Wu and Chin-Yew Lin and Yuqing Yang and Lili Qiu\",\n editor = \"Bouamor, Houda and\n Pino, Juan and\n Bali, Kalika\",\n booktitle = \"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing\",\n month = dec,\n year = \"2023\",\n address = \"Singapore\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https:\/\/aclanthology.org\/2023.emnlp-main.825\",\n doi = \"10.18653\/v1\/2023.emnlp-main.825\",\n pages = \"13358--13376\",\n}<\/code><\/pre>\n\n\n\n\n\n
Paper: https:\/\/arxiv.org\/abs\/2310.06839 (opens in new tab)<\/span><\/a><\/p>\n\n\n\n
Project Pape: https:\/\/llmlingua.com\/longllmlingua.html (opens in new tab)<\/span><\/a><\/p>\n\n\n\n
In long context scenarios, large language models face three main challenges: higher computational cost, performance reduction, and position bias. Research indicates that LLM performance hinges on the density and position of key information in the input prompt. Inspired by these findings, we propose LongLLMLingua for prompt compression towards improving LLMs\u2019 perception of the key information to simultaneously address the three challenges. Our extensive evaluation across various long context scenarios demonstrates that LongLLMLingua not only enhances performance but also significantly reduces costs and latency. For instance, in the NaturalQuestions benchmark, LongLLMLingua<\/strong> (opens in new tab)<\/span><\/a> boosts performance by up to 21.4%<\/strong> with around 4x fewer tokens<\/strong> in GPT-3.5-Turbo, leading to substantial cost savings. It achieves a 94.0% cost reduction<\/strong> in the LooGLE benchmark. Moreover, when compressing prompts of about 10k tokens at ratios of 2x-6x, LongLLMLingua can accelerate end-to-end latency by 1.4x-2.6x.<\/p>\n\n\n\n
\n\n\n\n<\/figure>\n\n\n\n
Insights<\/strong><\/h3>\n\n\n\n
\n
- Natural language is redundant, amount of information varies.<\/li>\n\n\n\n
- LLMs can understand compressed prompt.<\/li>\n\n\n\n
- There is a trade-off between language completeness and compression ratio. (LLMLingua)<\/strong><\/li>\n\n\n\n
- GPT-4 can recover all the key information from a compressed prompt-emergent ability. (LLMLingua)<\/strong><\/li>\n\n\n\n
- The density and position of key information in a prompt affect the performance of downstream tasks. (LongLLMLingua)<\/strong><\/li>\n<\/ul>\n\n\n\n
For more details, please refer to the paper LongLLMLingua<\/strong> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n
Why LongLLMLingua<\/em>?<\/strong><\/h3>\n\n\n\n
In long context scenarios, the distribution of key information is generally very sparse. Previous work has found that the density and placement of relevant information significantly impact the performance of Large Language Models (LLMs), even for highly powerful models like GPT-4-Turbo. LongLLMLingua capitalizes on these distribution characteristics by employing prompt compression and reorganization. This strategy schedules and utilizes the limited but powerful context windows for LLMs more efficiently, effectively mitigating the “Lost in the middle” issue. As illustrated in the figure above, LongLLMLingua can achieve up to a 21.4% improvement on the NQ Multi-document QA task while using only 1\/4 of the tokens.<\/p>\n\n\n\n
<\/figure>\n\n\n\n
Our main contributions are five-fold<\/strong>:<\/p>\n\n\n\n
\n
- We propose a question-aware coarse-to-fine compression method<\/strong> to improve the key information density in the prompt.<\/li>\n\n\n\n
- We introduce a document reordering strategy<\/strong> to minimize position bias in LLMs.<\/li>\n\n\n\n
- We establish dynamic compression ratios<\/strong> for precise control between coarse and fine compression levels<\/li>\n\n\n\n
- We propose a post-compression subsequence recovery strategy<\/strong> to improve the integrity of the key information<\/li>\n\n\n\n
- We evaluate LongLLMLingua across five benchmarks<\/strong>, i.e., NaturalQuestions, LongBench, ZeroSCROLLS , MuSicQue, and LooGLE, covering a variety of long context scenarios. Experimental results reveal that LongLLMLingua\u2019s compressed prompts outperform original prompts in terms of performance, cost efficiency, and system latency.<\/li>\n<\/ul>\n\n\n\n
<\/figure>\n\n\n\n
Empirical Studies of Question-aware Compression<\/strong><\/h3>\n\n\n\n
To test the effectiveness of our proposed question-aware coarse-grained and fine-grained compression method, we conducted an empirical study across two dimensions.
Firstly, we analyzed the effectiveness of the question-aware coarse-grained approach by comparing it with several state-of-the-art (SoTA) retrieval methods in real Retrieval-Augmented Generation (RAG) scenarios. We discovered that our method not only surpasses traditional retrieval methods such as BM25 and Gzip but also outperforms embedding methods like OpenAI embedding, Jina, and BGE, as well as various reranker methods, including Cohere reranker and BGE-Reranker.
Secondly, we assessed the effectiveness of the question-aware fine-grained approach by comparing perplexity and contrastive perplexity across various document context scenarios. It was observed that contrastive perplexity effectively captures key information in documents, while perplexity struggles to identify relevant information.<\/p>\n\n\n\n<\/figure>\n\n\n\n
BibTeX<\/strong><\/h3>\n\n\n\n
@inproceedings{jiang-etal-2024-longllmlingua,\n title = \"{L}ong{LLML}ingua: Accelerating and Enhancing {LLM}s in Long Context Scenarios via Prompt Compression\",\n author = \"Huiqiang Jiang and Qianhui Wu and and Xufang Luo and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu\",\n editor = \"Ku, Lun-Wei and\n Martins, Andre and\n Srikumar, Vivek\",\n booktitle = \"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\",\n month = aug,\n year = \"2024\",\n address = \"Bangkok, Thailand\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https:\/\/aclanthology.org\/2024.acl-long.91\",\n pages = \"1658--1677\",\n}<\/code><\/pre>\n\n\n\n
\n\n\n\n\n\nPaper: https:\/\/arxiv.org\/abs\/2403.12968 (opens in new tab)<\/span><\/a><\/p>\n\n\n\n
Project Page: https:\/\/llmlingua.com\/llmlingua2.html (opens in new tab)<\/span><\/a><\/p>\n\n\n\n
Demo: https:\/\/huggingface.co\/spaces\/microsoft\/llmlingua-2 (opens in new tab)<\/span><\/a><\/p>\n\n\n\n
Why LLMLingua-2?<\/em><\/strong><\/h3>\n\n\n\n
<\/figure>\n\n\n\n
Challenges Encountered in Information Entropy Based Methods\u200b:<\/p>\n\n\n\n
\n
- \ud83e\udd14<\/strong> Perplexity or information entropy may be suboptimal for prompt trimming: Not aligned with the prompt compression objective.<\/strong> <\/li>\n\n\n\n
- \ud83e\udd16 How can we identify or build a suitable dataset to align the SLM<\/strong> towards effective prompt compression<\/strong>?<\/li>\n\n\n\n
- \u27a1\ufe0f<\/strong> Importance of tokens is context-dependent. Causal LMs only leverage unidirectional context<\/strong>, which may fail to capture all essential information within the context.<\/li>\n\n\n\n
- \ud83d\udd04 How can we design a compression algorithm that effectively leverage the full bidirectional context?<\/li>\n<\/ul>\n\n\n\n
Why Data Distillation?<\/strong><\/h3>\n\n\n\n
Shortcomings of Existing Text Compression Datasets \u200b:<\/p>\n\n\n\n
\n
- \ud83d\ude22<\/strong> Most text compression datasets are abstractive<\/strong>, which leads to slow autoregressive process<\/strong> and may produce hallucinated content<\/strong>.<\/li>\n\n\n\n
- \ud83e\udd37\u200d\u2642\ufe0f<\/strong> Extractive compression datasets such as SentComp<\/strong> (opens in new tab)<\/span><\/a> and DebateSum<\/strong> (opens in new tab)<\/span><\/a> are mainly created for the summarization task and often lack detailed information. In the case of prompt compression, we should retain essential information<\/strong> as much as possible.<\/li>\n<\/ul>\n\n\n\n
BibTeX<\/strong><\/h3>\n\n\n\n
@inproceedings{pan-etal-2024-llmlingua,\n title = \"{LLML}ingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression\",\n author = \"Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Menglin Xia and Xufang Luo and Jue Zhang and Qingwei Lin and Victor Ruhle and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Dongmei Zhang\",\n editor = \"Ku, Lun-Wei and\n Martins, Andre and\n Srikumar, Vivek\",\n booktitle = \"Findings of the Association for Computational Linguistics ACL 2024\",\n month = aug,\n year = \"2024\",\n address = \"Bangkok, Thailand and virtual meeting\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https:\/\/aclanthology.org\/2024.findings-acl.57\",\n pages = \"963--981\",\n}<\/code><\/pre>\n\n\n\n\n\n
<\/p>\n","protected":false},"excerpt":{"rendered":"
Effectively Deliver Information to LLMs via Prompt Compression LLMLingua Read More (opens in new tab) LongLLMLingua Read More (opens in new tab) LLMLingua-2 Read More (opens in new tab) Large language models (LLMs) have demonstrated remarkable capabilities and have been applied across various fields. Advancements in technologies such as Chain-of-Thought (CoT), In-Context Learning (ICL), and Retrieval-Augmented […]<\/p>\n","protected":false},"featured_media":978339,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13545],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-978333","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"2023-02-01","related-publications":[974562,978312,1016619],"related-downloads":[],"related-videos":[],"related-groups":[881388],"related-events":[],"related-opportunities":[],"related-posts":[987321,1025451],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Huiqiang Jiang","user_id":40807,"people_section":"Related people","alias":"hjiang"},{"type":"user_nicename","display_name":"Qianhui Wu","user_id":40741,"people_section":"Related people","alias":"qianhuiwu"},{"type":"user_nicename","display_name":"Xufang Luo","user_id":40324,"people_section":"Related people","alias":"xufluo"},{"type":"user_nicename","display_name":"Yuqing Yang","user_id":40654,"people_section":"Related people","alias":"yuqyang"},{"type":"user_nicename","display_name":"Chin-Yew Lin","user_id":31493,"people_section":"Related people","alias":"cyl"},{"type":"user_nicename","display_name":"Dongsheng Li","user_id":39402,"people_section":"Related people","alias":"dongsli"},{"type":"user_nicename","display_name":"Lili Qiu","user_id":41320,"people_section":"Related people","alias":"liliqiu"},{"type":"user_nicename","display_name":"Molly Xia","user_id":41943,"people_section":"Related people","alias":"mollyxia"},{"type":"user_nicename","display_name":"Jue Zhang","user_id":41212,"people_section":"Related people","alias":"juezhang"},{"type":"user_nicename","display_name":"Qingwei Lin \u6797\u5e86\u7ef4","user_id":33318,"people_section":"Related people","alias":"qlin"},{"type":"user_nicename","display_name":"Victor Ruehle","user_id":41027,"people_section":"Related people","alias":"virueh"},{"type":"user_nicename","display_name":"Dongmei Zhang","user_id":31665,"people_section":"Related people","alias":"dongmeiz"}],"msr_research_lab":[199560],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/978333","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":54,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/978333\/revisions"}],"predecessor-version":[{"id":1133151,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/978333\/revisions\/1133151"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/978339"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=978333"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=978333"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=978333"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=978333"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=978333"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}