{"id":978333,"date":"2023-10-23T01:48:37","date_gmt":"2023-10-23T08:48:37","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=978333"},"modified":"2024-08-21T23:08:14","modified_gmt":"2024-08-22T06:08:14","slug":"llmlingua","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/llmlingua\/","title":{"rendered":"LLMLingua Series"},"content":{"rendered":"
\n\t
\n\t\t
\n\t\t\t\"background\t\t<\/div>\n\t\t\n\t\t
\n\t\t\t\n\t\t\t
\n\t\t\t\t\n\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n

LLMLingua<\/h1>\n\n\n\n

Effectively Deliver Information to LLMs via Prompt Compression<\/strong><\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n

\n
\n

LLMLingua<\/p>\n\n\n\n

Identify and remove non-essential tokens in prompts using perplexity from a SLM<\/h3>\n\n\n\n

Read More (opens in new tab)<\/span><\/a><\/strong><\/p>\n<\/div>\n\n\n\n

\n

LongLLMLingua<\/p>\n\n\n\n

Enhance long-context information via query-aware compression and reorganization<\/h3>\n\n\n\n

Read More (opens in new tab)<\/span><\/a><\/strong><\/p>\n<\/div>\n\n\n\n

\n

LLMLingua-2<\/p>\n\n\n\n

Utilize data distillation to learn compression targets for efficient and faithful task-agnostic compression<\/h3>\n\n\n\n

Read More (opens in new tab)<\/span><\/a><\/strong><\/p>\n<\/div>\n<\/div>\n\n\n\n

Large language models (LLMs) have demonstrated remarkable capabilities and have been applied across various fields. Advancements in technologies such as Chain-of-Thought (CoT), In-Context Learning (ICL), and Retrieval-Augmented Generation (RAG) have led to increasingly lengthy prompts for LLMs, sometimes exceeding tens of thousands of tokens. Longer prompts, however, can result in 1) increased API response latency, 2) exceeded context window limits, 3) loss of contextual information, 4) expensive API bills, and 5) performance issues such as “lost in the middle.”<\/p>\n\n\n\n

Inspired by the concept of “LLMs as Compressors,” we designed a series of works that try to build a language for LLMs via prompt compression. This approach accelerates model inference, reduces costs, and improves downstream performance while revealing LLM context utilization and intelligence patterns. Our work achieved a 20x compression ratio<\/em> with minimal performance loss(LLMLingua<\/strong>), and a 17.1% performance improvement with 4x compression<\/em> (LongLLMLingua<\/strong>). LLMLingua-2<\/strong> (opens in new tab)<\/span><\/a>, a small-size yet powerful prompt compression method trained via data distillation from GPT-4 for token classification with a BERT-level encoder, excels in task-agnostic compression. It surpasses LLMLingua in handling out-of-domain data, offering 3x-6x faster performance.<\/p>\n\n\n\n

This page is for research demonstration purposes<\/strong> only. <\/p>\n\n\n\n

If you are interested in our ideas, please feel free to use LLMLingua<\/strong> and communicate with us.<\/p>\n\n\n\n

News<\/strong><\/h3>\n\n\n\n