{"id":991758,"date":"2023-12-12T06:40:31","date_gmt":"2023-12-12T14:40:31","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=991758"},"modified":"2023-12-14T13:13:08","modified_gmt":"2023-12-14T21:13:08","slug":"steering-at-the-frontier-extending-the-power-of-prompting","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/steering-at-the-frontier-extending-the-power-of-prompting\/","title":{"rendered":"Steering at the Frontier: Extending the Power of Prompting"},"content":{"rendered":"\n
\"three<\/figure>\n\n\n\n

We\u2019re seeing exciting capabilities of frontier foundation models, including intriguing powers of abstraction, generalization, and composition across numerous areas of knowledge and expertise.\u202fEven seasoned AI researchers have been impressed with the ability to steer the models with straightforward, zero-shot prompts.\u202fBeyond basic, out-of-the-box prompting, we\u2019ve been exploring new prompting strategies, showcased in our Medprompt<\/a> work, to evoke the powers of specialists.  <\/p>\n\n\n\n

Today, we\u2019re sharing information on Medprompt and other approaches to steering frontier models in promptbase (opens in new tab)<\/span><\/a><\/em>, a collection of resources on GitHub. Our goal is to provide information and tools to engineers and customers to evoke the best performance from foundation models. We\u2019ll start by including scripts that enable replication of our results using the prompting strategies that we present here. We\u2019ll be adding more sophisticated general-purpose tools and information over the coming weeks.  <\/p>\n\n\n\n

As an illustration of the capabilities of the frontier models and on opportunities to harness and extend the recent efforts with reaching state-of-the-art (SoTA) results via steering GPT-4, we\u2019ll review SoTA results on benchmarks that Google chose for evaluating Gemini Ultra. Our end-to-end exploration, prompt design, and computing of performance took just a couple of days.<\/p>\n\n\n\n\t

\n\t\t\n\n\t\t

\n\t\tSpotlight: Blog post<\/span>\n\t<\/p>\n\t\n\t

\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"White\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

Eureka: Evaluating and understanding progress in AI<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

How can we rigorously evaluate and understand state-of-the-art progress in AI? Eureka is an open-source framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. Learn more about the extended findings.\u00a0<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tRead more\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

Let\u2019s focus on the well-known MMLU (opens in new tab)<\/span><\/a> (Measuring Massive Multitask Language Understanding) challenge that was established as a test of general knowledge and reasoning powers of large language models.  The complete MMLU benchmark contains tens of thousands of challenge problems of different forms across 57 areas from basic mathematics to United States history, law, computer science, engineering, medicine, and more.  <\/p>\n\n\n\n

In our Medprompt study<\/a>, we focused on medical challenge problems, but found that the prompt strategy could have more general-purpose application and examined its performance on several out-of-domain benchmarks\u2014despite the roots of the work on medical challenges. Today, we report that steering GPT-4 with a modified version of Medprompt achieves the highest score ever achieved on the complete MMLU.<\/em><\/p>\n\n\n\n

\"A
Figure1. Reported performance of multiple models and methods on the MMLU benchmark.<\/figcaption><\/figure>\n\n\n\n

In our explorations, we initially found that applying the original Medprompt to GPT-4 on the comprehensive MMLU achieved a score of 89.1%. By increasing the number of ensembled calls in Medprompt from five to 20, performance by GPT-4 on the MMLU further increased to 89.56%. To achieve a new SoTA on MMLU, we extended Medprompt to Medprompt+ by adding a simpler prompting method and formulating a policy for deriving a final answer by integrating outputs from both the base Medprompt strategy and the simple prompts. The synthesis of a final answer is guided by a control strategy governed by GPT-4 and inferred confidences of candidate answers. More details on Medprompt+ are provided in the promptbase repo. A related method for coupling complex and simple queries was harnessed by the Google Gemini team. GPT-4 steered with the modified Medprompt+ reaches a record score of 90.10%. We note that Medprompt+ relies on accessing confidence scores (logprobs) from GPT-4. These are not publicly available via the current API but will be enabled for all in the near future.<\/p>\n\n\n\n

While systematic prompt engineering can yield maximal performance, we continue to explore the out-of-the-box performance of frontier models with simple prompts. It\u2019s important to keep an eye on the native power of GPT-4 and how we can steer the model with zero- or few-shot prompting strategies. As demonstrated in Table 1, starting with simple prompting is useful to establish baseline performance before layering in more sophisticated and expensive methods.<\/p>\n\n\n\n

Benchmark<\/th>GPT-4 Prompt<\/th>GPT-4 Results<\/th>Gemini Ultra Results<\/th><\/tr><\/thead>
MMLU<\/td>Medprompt+<\/td>90.10%<\/strong><\/td>90.04%<\/td><\/tr>
GSM8K<\/td>Zero-shot<\/td>95.27%<\/strong><\/td>94.4%<\/td><\/tr>
MATH<\/td>Zero-shot<\/td>68.42%<\/strong><\/td>53.2%<\/td><\/tr>
HumanEval<\/td>Zero-shot<\/td>87.8<\/strong>%<\/td>74.4%<\/td><\/tr>
BIG-Bench-Hard<\/td>Few-shot + CoT*<\/td>89.0%<\/strong><\/td>83.6% <\/td><\/tr>
DROP<\/td>Zero-shot + CoT<\/td>83.7%<\/strong><\/td>82.4%<\/td><\/tr>
HellaSwag<\/td>10-shot**<\/td>95.3%**<\/strong><\/td>87.8%<\/td><\/tr><\/tbody><\/table>
* followed the norm of evaluations and used standard few-shot examples from dataset creators 
** source: Google <\/sup>
Table 1: Model, strategies, and results<\/center><\/figcaption><\/figure>\n\n\n\n

We encourage you to check out the promptbase repo (opens in new tab)<\/span><\/a> on GitHub for more details about prompting techniques and tools. This area of work is evolving with much to learn and share. We\u2019re excited about the directions and possibilities ahead.<\/p>\n","protected":false},"excerpt":{"rendered":"

We\u2019re seeing exciting capabilities of frontier foundation models, including intriguing powers of abstraction, generalization, and composition across numerous areas of knowledge and expertise.\u202fEven seasoned AI researchers have been impressed with the ability to steer the models with straightforward, zero-shot prompts.\u202fBeyond basic, out-of-the-box prompting, we\u2019ve been exploring new prompting strategies, showcased in our Medprompt work, to […]<\/p>\n","protected":false},"author":42735,"featured_media":991947,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[264846,261673],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-991758","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":["Computing foundations","Health"],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[968280],"related-researchers":[{"type":"user_nicename","value":"Eric Horvitz","user_id":32033,"display_name":"Eric Horvitz","author_link":"Eric Horvitz<\/a>","is_active":false,"last_first":"Horvitz, Eric","people_section":0,"alias":"horvitz"},{"type":"user_nicename","value":"Harsha Nori","user_id":41461,"display_name":"Harsha Nori","author_link":"Harsha Nori<\/a>","is_active":false,"last_first":"Nori, Harsha","people_section":0,"alias":"hanori"},{"type":"user_nicename","value":"Yin Tat Lee","user_id":42684,"display_name":"Yin Tat Lee","author_link":"Yin Tat Lee<\/a>","is_active":false,"last_first":"Lee, Yin Tat","people_section":0,"alias":"yintatlee"}],"msr_type":"Post","featured_image_thumbnail":"\"three","byline":"Eric Horvitz<\/a>, Harsha Nori<\/a>, and Yin Tat Lee<\/a>","formattedDate":"December 12, 2023","formattedExcerpt":"We\u2019re seeing exciting capabilities of frontier foundation models, including intriguing powers of abstraction, generalization, and composition across numerous areas of knowledge and expertise.\u202fEven seasoned AI researchers have been impressed with the ability to steer the models with straightforward, zero-shot prompts.\u202fBeyond basic, out-of-the-box prompting, we\u2019ve been…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/991758"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=991758"}],"version-history":[{"count":19,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/991758\/revisions"}],"predecessor-version":[{"id":992049,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/991758\/revisions\/992049"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/991947"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=991758"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=991758"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=991758"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=991758"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=991758"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=991758"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=991758"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=991758"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=991758"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=991758"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=991758"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}