{"id":983583,"date":"2023-11-17T11:07:15","date_gmt":"2023-11-17T19:07:15","guid":{"rendered":""},"modified":"2024-01-17T12:20:28","modified_gmt":"2024-01-17T20:20:28","slug":"skeleton-of-thought-parallel-decoding-speeds-up-and-improves-llm-output","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/skeleton-of-thought-parallel-decoding-speeds-up-and-improves-llm-output\/","title":{"rendered":"Skeleton-of-Thought: Parallel decoding speeds up and improves LLM output"},"content":{"rendered":"\n
\"A<\/figure>\n\n\n\n

This research was accepted by the 2024 International Conference on Learning Representations.<\/em><\/strong><\/p>\n\n\n\n

Large language models (LLMs) such as LLaMA and OpenAI\u2019s GPT-4 are revolutionizing technology. However, one of the common complaints about LLMs is their speed, <\/em>or lack thereof. In many cases, it takes a long time to get an answer from them. This limits LLMs\u2019 applications and their usefulness in latency-critical functions, such as chatbots, copilots, and industrial controllers.<\/p>\n\n\n\n

\n\t
\n\t\t
\n\t\t\t\t\t\tPublication<\/span>\n\t\t\tSkeleton-of-Thought: Large Language Models Can Do Parallel Decoding<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n

To address this question, researchers from Microsoft Research and Tsinghua University proposed Skeleton-of-Thought (SoT)<\/a>, a new approach to accelerate generation of LLMs. Unlike most prior methods, which require modifications on the LLM models, systems, or hardware, SoT treats LLMs as black boxes and can therefore be applied on any off-the-shelf open-source (e.g., LLaMA) or even API-based (e.g., OpenAI\u2019s GPT-4) models. Our evaluation shows that not only does SoT considerably accelerate content generation among the 12 LLMs examined, it may also improve the answer quality in some cases.<\/em> For example, on OpenAI\u2019s GPT-3.5 and GPT-4, SoT provides 2x speed-up while improving the answer quality on benchmark datasets.<\/p>\n\n\n\n

Our code and demo are open-sourced at https:\/\/github.com\/imagination-research\/sot\/ (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n

\"SoT
Figure 1: Compared to the vanilla approach (left), SoT (right) provides 3.72x speed-up on answering the question: \u201cHow can I improve my time management techniques?\u201d with LLaMA-2-7b model on one NVIDIA A100 GPU.<\/figcaption><\/figure>\n\n\n\n

SoT: Encouraging structured thinking in LLMs<\/h2>\n\n\n\n

The idea of SoT stems from the difference in how LLMs and humans process information. LLMs generate answers sequentially<\/em>. For example, to answer \u201cHow can I improve my time management techniques?\u201d<\/em> in Figure 1 (left), the LLM finishes one point before moving to the next. In contrast, humans may not always think about questions and write answers sequentially. In many cases, humans first derive the skeleton of the answer and then add details to explain each point. For example, to answer the same question in Figure 1, a person might first think about a list of relevant time management techniques before digging into the details of each technique. This is especially the case for exercises like offering consultancy, taking tests, writing papers, and so on. <\/p>\n\n\n\n

Can we make LLMs process information more dynamically and less linearly? As illustrated in Figure 2, SoT does the trick. Instead of generating answers sequentially, SoT decomposes the generation into two stages: (1) SoT first asks the LLM to derive a skeleton of the answer, and then (2) asks the LLM to provide the answer to each point in the skeleton. This method offers a new opportunity for acceleration, as the answers to separate points in stage 2 can be generated in parallel. This can be done for both local models, whose weights are accessible by the users (e.g., LLaMA), and API-based models which can only be accessed through APIs (e.g., OpenAI\u2019s GPT-4).<\/em><\/p>\n\n\n\n