{"id":983583,"date":"2023-11-17T11:07:15","date_gmt":"2023-11-17T19:07:15","guid":{"rendered":""},"modified":"2024-01-17T12:20:28","modified_gmt":"2024-01-17T20:20:28","slug":"skeleton-of-thought-parallel-decoding-speeds-up-and-improves-llm-output","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/skeleton-of-thought-parallel-decoding-speeds-up-and-improves-llm-output\/","title":{"rendered":"Skeleton-of-Thought: Parallel decoding speeds up and improves LLM output"},"content":{"rendered":"\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1.png\" alt=\"A figure showing the difference between the normal sequential decoding approach and the Skeleton-of-Thought approach. Given a question, the left part of the figure shows that the normal sequential decoding approach generates the answer sequentially from the beginning to the end. The right part of the figure shows that the Skeleton-of-Thought approach first prompts the LLM to give a skeleton of answer and then expand multiple points in the skeleton in parallel to get the final answer.\" class=\"wp-image-984846\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-240x135.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-1280x720.png 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p><strong><em>This research was accepted by the 2024 International Conference on Learning Representations.<\/em><\/strong><\/p>\n\n\n\n<p>Large language models (LLMs) such as LLaMA and OpenAI\u2019s GPT-4 are revolutionizing technology. However, one of the common complaints about LLMs is their <em>speed, <\/em>or lack thereof. In many cases, it takes a long time to get an answer from them. This limits LLMs\u2019 applications and their usefulness in latency-critical functions, such as chatbots, copilots, and industrial controllers.<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<ul class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<li class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/skeleton-of-thought-large-language-models-can-do-parallel-decoding\/\" target=\"_blank\" class=\"annotations__link font-weight-semibold text-decoration-none\" data-bi-type=\"annotated-link\" aria-label=\"Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding\" data-bi-aN=\"margin-callout\" data-bi-cN=\"Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding\">\n\t\t\t\tSkeleton-of-Thought: Large Language Models Can Do Parallel Decoding&nbsp;<span class=\"glyph-append glyph-append-chevron-right glyph-append-xsmall\"><\/span>\n\t\t\t<\/a>\n\t\t\t\t\t<\/li>\n\t<\/ul>\n<\/div>\n\n\n\n<p>To address this question, researchers from Microsoft Research and Tsinghua University proposed <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/skeleton-of-thought-large-language-models-can-do-parallel-decoding\/\" target=\"_blank\" rel=\"noreferrer noopener\">Skeleton-of-Thought (SoT)<\/a>, a new approach to accelerate generation of LLMs. Unlike most prior methods, which require modifications on the LLM models, systems, or hardware, SoT treats LLMs as black boxes and can therefore be applied on any off-the-shelf open-source (e.g., LLaMA) or even API-based (e.g., OpenAI\u2019s GPT-4) models. Our evaluation shows that <em>not only does SoT considerably accelerate content generation among the 12 LLMs examined, it may also improve the answer quality in some cases.<\/em> For example, on OpenAI\u2019s GPT-3.5 and GPT-4, SoT provides 2x speed-up while improving the answer quality on benchmark datasets.<\/p>\n\n\n\n<p>Our code and demo are open-sourced at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/imagination-research\/sot\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/imagination-research\/sot\/<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"934\" height=\"468\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT_Figure1_compressed.gif\" alt=\"SoT - Figure 1: A screen recording of the demo of the normal sequential decoding approach and the Skeleton-of-Thought approach on answering the question \u201cHow can I improve my time management techniques?\u201d.  The left figure shows the generation process of normal sequential decoding, where the answers are generated word-by-word and. The right figure shows the generation process of Skeleton-of-Thought, where multiple points in the answer are generated in parallel and is therefore faster. After the generation of both methods is done, the generation time of both methods is displayed, which suggests that Skeleton-of-Thought provides 3.72x speed-up.\" class=\"wp-image-984804\"\/><figcaption class=\"wp-element-caption\">Figure 1: Compared to the vanilla approach (left), SoT (right) provides 3.72x speed-up on answering the question: \u201cHow can I improve my time management techniques?\u201d with LLaMA-2-7b model on one NVIDIA A100 GPU.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"sot-encouraging-structured-thinking-in-llms\">SoT: Encouraging structured thinking in LLMs<\/h2>\n\n\n\n<p>The idea of SoT stems from the difference in how LLMs and humans process information. LLMs generate answers <em>sequentially<\/em>. For example, to answer <em>\u201cHow can I improve my time management techniques?\u201d<\/em> in Figure 1 (left), the LLM finishes one point before moving to the next. In contrast, humans may not always think about questions and write answers sequentially. In many cases, humans first derive the skeleton of the answer and then add details to explain each point. For example, to answer the same question in Figure 1, a person might first think about a list of relevant time management techniques before digging into the details of each technique. This is especially the case for exercises like offering consultancy, taking tests, writing papers, and so on.&nbsp;<\/p>\n\n\n\n<p>Can we make LLMs process information more dynamically and less linearly? As illustrated in Figure 2, SoT does the trick. Instead of generating answers sequentially, SoT decomposes the generation into two stages: (1) SoT first asks the LLM to derive a skeleton of the answer, and then (2) asks the LLM to provide the answer to each point in the skeleton. This method offers a new opportunity for acceleration, as <em>the answers to separate points in stage 2 can be generated in parallel. This can be done for both local models, whose weights are accessible by the users (e.g., LLaMA), and API-based models which can only be accessed through APIs (e.g., OpenAI\u2019s GPT-4).<\/em><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>For API-based models, we can issue <em>parallel<\/em> API requests, one for each point.&nbsp;<\/li>\n\n\n\n<li>For models that are running locally, we can answer all points <em>simultaneously<\/em> in a batch. Note that in many scenarios (e.g., local service, centralized service within unsaturated query period), the decoding phase of LLMs is usually bottlenecked by weight loading instead of activation loading or computation, and thus underutilizes available hardware. In these cases, running LLM inference with increased batch sizes improves the hardware computation utilization and does not significantly increase latency.<\/li>\n<\/ul>\n\n\n\n<p>Consequently, if there are <em>B<\/em> points in the answer, generating these points in parallel as in SoT can theoretically give up to <em>B<\/em>x speed-up compared to sequential generation as in current LLMs. However, in practice, due to the extra skeleton stage, unbalanced point lengths, and other overheads, the actual speed-up can be smaller.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"882\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure2.png\" alt=\"A figure showing the difference between the normal sequential decoding approach and the Skeleton-of-Thought approach. Given a question, the left part of the figure shows that the normal sequential decoding approach generates the answer sequentially from the beginning to the end. The right part of the figure shows that the Skeleton-of-Thought approach first prompts the LLM to give a skeleton of answer and then expand multiple points in the skeleton in parallel to get the final answer.\" class=\"wp-image-983661\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure2.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure2-300x189.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure2-1024x645.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure2-768x484.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure2-240x151.png 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 2: An illustration of Skeleton-of-Thought (SoT). Instead of producing answers <em>sequentially<\/em> (left), SoT (right) produces different parts of answers <em>in parallel<\/em>. Given the question, SoT first prompts the LLM to give a skeleton of the answer and then conducts batched decoding or parallel API calls to simultaneously expand multiple points to get the final answer.<\/figcaption><\/figure>\n\n\n\n<p>We test SoT on 12 recently released models, including nine open-source models and three API-based models as shown in Table 1. We use the Vicuna-80 dataset, which contains 80 questions spanning nine categories, such as coding, math, writing, roleplay, and so on. For results on more datasets and metrics, please refer to the paper: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/skeleton-of-thought-large-language-models-can-do-parallel-decoding\/\">Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"533\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Table1.png\" alt=\"SoT - Table 1: A table listing the models evaluated in the paper. There are four columns: (1) access, which means whether the model is open-source or API-based, (2) model name, (3) the institution who develops the model, and (4) the released date of the model. Therefore 12 rows in the table, corresponding to 9 open-source models and 3 API-based models.\" class=\"wp-image-983676\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Table1.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Table1-300x114.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Table1-1024x390.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Table1-768x292.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Table1-240x91.png 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Table 1: The list of evaluated models.<\/figcaption><\/figure>\n\n\n\n<p>Figure 3 shows that SoT provides considerable speed-up across all models. In particular, SoT obtains a >2x speed-up (up to 2.39x) on eight out of 12 models. Moreover, SoT achieves this level of speed-up without significant answer quality degradation. Figure 4 shows the win\/tie\/lose rates of SoT (defined as the fraction of questions for which SoT provides better answers than the normal sequential generation; \u201cbetter\u201d is defined by metrics proposed in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/lm-sys\/FastChat\" target=\"_blank\" rel=\"noreferrer noopener\">FastChat<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/FreedomIntelligence\/LLMZoo\" target=\"_blank\" rel=\"noreferrer noopener\">LLMZoo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and evaluated by the GPT-4 judge). We can see that the quality of SoT\u2019s answer is on par with the sequential generation. In the paper, we further show that SoT improves answer quality in terms of relevance to the question and comprehensiveness across multiple aspects, thanks to the skeleton stage, which explicitly requires LLMs to plan the answer structure in advance.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"513\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure3.png\" alt=\"SoT - Figure 3: A bar plot showing the average speed-up of Skeleton-of-Thought on different models. Each bar corresponds to one model. The figure shows that Skeleton-of-Thought provides speed-up for all models. The speed-up ranges from 1.13x to 2.39x.\" class=\"wp-image-983664\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure3.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure3-300x110.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure3-1024x375.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure3-768x281.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure3-240x88.png 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 3: Average speed-ups of SoT on different models. SoT provides speed-up for all models we evaluated.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1131\" height=\"293\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure4-patterned.png\" alt=\"SoT - Figure 4: A plot showing the win\/tie\/lose rates of Skeleton-of-Thought compared with the normal sequential generation using metrics from FastChat and LLMZoo. For FastChat metric, the win\/tie\/lose rates are 29.5%, 29.3%, and 41.2%. For LLMZoo metric, the win\/tie\/lose rates are 45.8%, 19.6%, and 34.5%. In summary, Skeleton-of-Thought performs better than or equal to normal sequential generation in around 60% of cases.\" class=\"wp-image-983667\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure4-patterned.png 1131w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure4-patterned-300x78.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure4-patterned-1024x265.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure4-patterned-768x199.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure4-patterned-240x62.png 240w\" sizes=\"auto, (max-width: 1131px) 100vw, 1131px\" \/><figcaption class=\"wp-element-caption\">Figure 4:  Win\/tie\/lose rates of SoT vs. normal generation using metrics from FastChat and LLMZoo. SoT performs better than or equal to normal generation in around 60% of cases.<\/figcaption><\/figure>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1116360\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Microsoft Research Blog<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/microsoft-research-2024-a-year-in-review\/\" aria-label=\"Research at Microsoft 2024: Meeting the challenge of a changing world\" data-bi-cN=\"Research at Microsoft 2024: Meeting the challenge of a changing world\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/12\/Year-in-review-2024_Stories_Hero_Feature-1400x788-1.jpg\" alt=\"Research at Microsoft 2024 - Year in Review\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Research at Microsoft 2024: Meeting the challenge of a changing world<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p class=\"large\">In this new AI era, technology is changing even faster than before, and the transition from research to reality, from concept to solution, now takes days or weeks rather than months or years.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/microsoft-research-2024-a-year-in-review\/\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" aria-label=\"Read more\" data-bi-cN=\"Research at Microsoft 2024: Meeting the challenge of a changing world\" target=\"_blank\">\n\t\t\t\t\t\t\tRead more\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"sot-r-adaptively-triggering-sot\">SoT-R: Adaptively triggering SoT<\/h2>\n\n\n\n<p>SoT conducts an independent and parallel expansion of points. Consequently, it is not suitable for questions that require step-by-step reasoning, such as math and coding. To this end, we propose an extension of SoT called <em>SoT with Router (SoT-R)<\/em> that adaptively triggers SoT only when it is suitable. More specifically, we propose a router module that decides if SoT should be applied for the user request and then we call either SoT or normal sequential decoding, accordingly. The router module can be implemented either by prompting OpenAI\u2019s GPT-4 without any model training (called \u201c<em>prompting router\u201d<\/em>) or training a specified RoBERTa model (called \u201c<em>trained router\u201d<\/em>). We show that SoT-R improves SoT\u2019s universality across question categories (Figure 5) while maintaining considerable speed-up (Figure 6).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"538\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure5.png\" alt=\"SoT - Figure 5: A plot showing the net win rates (defined as win rates minus lose rates) of SoT and SoT-R on different question categories on Vicuna-80 dataset. For the question categories not suitable for SoT (e.g., coding, math), SoT-R has net win rates around 0 as SoT-R learns to fall back to the normal generation mode. For question categories that are suitable for SoT (e.g., generic, counterfactual), SoT-R has similar net win rates as SoT and SoT-R triggers SoT as expected. Overall, SoT-R improves upon SoT and maintains good answer quality for all question categories.\" class=\"wp-image-983670\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure5.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure5-300x115.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure5-1024x394.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure5-768x295.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure5-240x92.png 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 5: The net win rates (defined as win rates minus lose rates \u2013 higher is better) of SoT and SoT-R on different question categories on Vicuna-80. \u201cHuman router\u201d refers to the oracle that uses human preference to decide whether SoT should be applied for each question. For the question categories not suitable for SoT (e.g., coding, math), SoT-R learns to fall back to the normal generation mode. Consequently, SoT-R can maintain good answer quality for all question categories. SoT-R could even surpass the human router (e.g., on roleplay questions).<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-blog_Figure6.png\" alt=\"SoT - Figure 6: A plot showing the speed-ups of SoT and SoT-R on different models of Vicuna-80. While SoT-R has smaller speed-up than SoT, SoT-R can still keep >1 speed-up for most models.\" class=\"wp-image-983673\"\/><figcaption class=\"wp-element-caption\">Figure 6: The speed-ups of SoT and SoT-R on different models of Vicuna-80. SoT-R can keep >1 speed-up for most models.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"limitations-and-next-steps\">Limitations and next steps<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"reducing-the-cost-of-sot\">Reducing the cost of SoT<\/h3>\n\n\n\n<p>SoT uses longer prompts compared to normal sequential decoding, which can lead to higher costs on LLM APIs that are charged by the length of the prompt and potentially reduce the throughput of the serving system. Further research is needed to explore reducing the cost of SoT, including compressing SoT prompts or tuning the LLMs to automatically trigger SoT when necessary.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"improving-llm-capability\">Improving LLM capability<\/h3>\n\n\n\n<p>SoT is inspired by the structured thinking of humans. As such, its success warrants further examination of the human thinking process that could facilitate more effective and efficient AI.&nbsp;<\/p>\n\n\n\n<p>Given the sophistication of human thinking, which we might envision as more of a graph, SoT can be viewed as an interim step towards a &#8220;Graph of Thoughts&#8221; (GoT). GoT is a new framework that aims to represent more complex thinking that more closely approximates the way people solve problems. GoT presents multiple concepts connected in a graph structure, in which the graph\u2019s edges represent the dependencies, and each point is decoded based on the contents of its ancestor points. In addition, instead of complying with a static graph, we expect the need to have a dynamic Graph-of-Thought, where the high-level thought structure is adjusted dynamically by LLMs themselves, attempting to mimic how humans think. This could potentially combine the efficiency and global thinking advantages of SoT while capturing more complexity in the thinking process.&nbsp;<\/p>\n\n\n\n<p>Another question to examine is how the more structured answers from SoT can be used to fine-tune LLMs to enhance their ability to generate well-organized and comprehensive answers.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"data-centric-efficiency-optimization\">Data-centric efficiency optimization<\/h3>\n\n\n\n<p>In contrast to existing model- and system-level efforts to boost LLM efficiency, SoT takes a novel \u201cdata-level\u201d pathway by letting the LLM organize its output content. This perspective is becoming feasible and increasingly relevant, owing to the evolving capabilities of state-of-the-art LLMs. We hope this work can stimulate more research in the realm of data-centric optimization for efficiency.<\/p>\n\n\n\n<p>Please feel free to contact us if you\u2019d like to discuss this research and potential collaborations on this topic.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\">Contact us<\/a><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>This research was accepted by the 2024 International Conference on Learning Representations. Large language models (LLMs) such as LLaMA and OpenAI\u2019s GPT-4 are revolutionizing technology. However, one of the common complaints about LLMs is their speed, or lack thereof. In many cases, it takes a long time to get an answer from them. This limits [&hellip;]<\/p>\n","protected":false},"author":42735,"featured_media":983658,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[264846],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-983583","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":["Computing foundations"],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[437022],"related-projects":[],"related-events":[],"related-researchers":[{"type":"guest","value":"xuefei-ning","user_id":"983595","display_name":"Xuefei Ning","author_link":"<a href=\"https:\/\/www.ningxuefei.cc\/\" aria-label=\"Visit the profile page for Xuefei Ning\">Xuefei Ning<\/a>","is_active":true,"last_first":"Ning, Xuefei","people_section":0,"alias":"xuefei-ning"},{"type":"user_nicename","value":"Zinan Lin","user_id":42327,"display_name":"Zinan Lin","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zinanlin\/\" aria-label=\"Visit the profile page for Zinan Lin\">Zinan Lin<\/a>","is_active":false,"last_first":"Lin, Zinan","people_section":0,"alias":"zinanlin"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Skeleton of Thought blog hero - flow diagram\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/11\/SoT-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.ningxuefei.cc\/\" title=\"Go to researcher profile for Xuefei Ning\" aria-label=\"Go to researcher profile for Xuefei Ning\" data-bi-type=\"byline author\" data-bi-cN=\"Xuefei Ning\">Xuefei Ning<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/zinanlin\/\" title=\"Go to researcher profile for Zinan Lin\" aria-label=\"Go to researcher profile for Zinan Lin\" data-bi-type=\"byline author\" data-bi-cN=\"Zinan Lin\">Zinan Lin<\/a>","formattedDate":"November 17, 2023","formattedExcerpt":"This research was accepted by the 2024 International Conference on Learning Representations. Large language models (LLMs) such as LLaMA and OpenAI\u2019s GPT-4 are revolutionizing technology. However, one of the common complaints about LLMs is their speed, or lack thereof. In many cases, it takes a&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/983583","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=983583"}],"version-history":[{"count":21,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/983583\/revisions"}],"predecessor-version":[{"id":999498,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/983583\/revisions\/999498"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/983658"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=983583"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=983583"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=983583"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=983583"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=983583"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=983583"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=983583"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=983583"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=983583"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=983583"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=983583"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}