Advances in run-time strategies for next-generation foundation models

Published

By , Chief Scientific Officer , Director, Research Engineering , Principal Researcher

A visual illustration of Medprompt performance on the MedQA benchmark. Moving from left to right on a horizontal line, the illustration shows how different Medprompt components and additive contributions improve accuracy starting with zero-shot at 81.7 accuracy, to random few-shot at 83.9 accuracy, to random few-shot, chain-of-thought at 87.3 accuracy, to kNN, few-shot, chain-of-thought at 88.4 accuracy, to ensemble with choice shuffle at 90.2 accuracy.

Groundbreaking advancements in frontier language models are progressing rapidly, paving the way for boosts in accuracy and reliability of generalist models, making them highly effective in specialized domains. As part of our ongoing exploration of foundation model capabilities, we developed Medprompt last year—a novel approach to maximize model performance on specialized domain and tasks without fine-tuning. By leveraging multiphase prompting, Medprompt optimizes inference by identifying the most effective chain-of-thought (CoT) examples at run time and drawing on multiple calls to refine output. When deployed with GPT-4, Medprompt achieved an impressive 90.2% accuracy on the MedQA benchmark (USMLE-style), outperforming all other methods. 

A line chart that plots the MedQA test accuracy (y-axis) over time (x-axis).  

Open AI o1-preview model achieves the highest result at 96.0% accuracy followed by Med-Gemini at 91.1%; GPT-4 (Medprompt) at 90.2%; Med PaLM 2 at 86.5; GPT-4 base at 86.1; Med PaLM at 67.2; GPT-3.5 base at 60.2, BioMedLM at 50.3; DRAGON at 47.5; BioLinkBERT at 45.1; PubMedBERT at 38.1.
Figure 1. Comparative analyses of performance of multiple models on MedQA.

Less than a year later, our tests show the OpenAI o1-preview demonstrated superior performance over Medprompt, reaching 96% on the same benchmark (Figure 1)—without using sophisticated prompt guidance and control. This advancement is driven by the model’s integration of run-time strategies at its core, enabling state-of-the-art results on medical licensing exams in the United States and Japan, medical subsets of the Massive Multitask Language Understanding (MMLU) benchmark, and nursing exams (NCLEX) as shown in Figure 2. 

A spider web chart plotting the performance of OpenAI o1-preview (0 shot ensemble) compared to GPT-4 (Medprompt) and GPT-4 (5 shot) model performance on medical challenge problems. o1-preview achieves state-of-the-art results on MedQA US (4-option), JMLE-2024, MedMCQA Dev, MMLU Anatomy, MMLU Medical Genetics, MMLU Professional Medicine, MMLU College Biology, and MMLU College Medicine, and NCLEX. GPT-4 (Medprompt) performed better than OpenAI o1-preview (0 shot ensemble) on MMLU Clinical Knowledge
Figure 2. Comparisons on a wide range of medical challenge benchmarks.

These results are notable, prompting us to publish our recent study, findings, and analyses, From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond (opens in new tab). But the numbers are only part of the story. In this blog, we discuss prompting strategies to make the most of o1-preview models and other factors to consider as well as directions forward for run-time strategies.

Is o1-preview “just” fancy prompting? 

The introduction of the OpenAI o1 model series marks a significant shift from prior GPT models. Unlike GPT, o1 models are trained using reinforcement learning (RL) techniques that enable them to “think” before generating outputs. While Medprompt relies on a cascade of operations with GPT-4 at run time guided by a multistage prompt, the o1 series incorporates this run-time reasoning directly into its RL-based design. The built-in functionality enables the o1 models to significantly outperform even the best results using GPT-4 and Medprompt. The performance gains come with a notable tradeoff: its per-token cost was approximately six times that of GPT-4o at the time of our evaluation. While the results for GPT-4o with Medprompt fall short of o1-preview model performance, the combination offers a more cost-effective alternative. The cost-benefit tradeoffs are highlighted in the following figure, with the x-axis presented on a logarithmic scale.

A line chart plotting accuracy on the MedQA Test (y-axis) versus total cost on a logarithmic scale (x-axis). OpenAI o1-preview using 5x, 10x, and 15x Ensemble hover around 1000 total cost. OpenAI o1-preview using Tailored Prompt, Minimal Prompt, Few-shot, kNN Few-shot are around 100 total cost. GPT-4o with Medprompt is below 100; kNN Few-shot CoT, Few-shot CoT, and Few-Shot are at 10; Zero-shot is at 1. GPT-4-Turbo with Medprompt is at 200; kNN Few-shot CoT, Few-shot CoT, and Few-Shot hover near 50, Zero-shot is near 5.
Figure 3. Pareto frontier showing accuracy versus total API cost (log scale) on the MedQA benchmark (1273 questions total). o1-preview (Sep 2024) is compared with GPT-4o (Aug 2024) and GPT-4 Turbo (Nov 2023).

Can we prompt engineer o1-preview?

The o1-preview model exhibits distinct run-time behaviors compared to the GPT series. While some of our more dynamic prompting strategies performed better than expected with o1-preview models, our most tried-and-true strategy was anything but consistent throughout our evaluation. Figure 4 captures specific performance results for Tailored Prompt, Ensembling, and Few-Shot Prompting on o1-preview. Here’s a summary of our findings: 

  1. Tailored Prompt: While minimal prompting—like a brief, one-sentence description followed by a question—offered a strong baseline performance, detailed task descriptions were best for eliciting accurate responses.
  2. Ensembling: Generating multiple answers per question and using majority voting across different reasoning paths boosted reliability, while shuffling answers in runs produced richer reasoning chains and improved outcomes. Ensembling continues to yield consistent performance improvements.
  3. Few-Shot Prompting: Guiding the model with a few examples produced inconsistent results and, on average, decreased performance compared with GPT models.
Three charts show the accuracy of o1-preview when combined with Tailored Prompt, Ensemble, and 5-shot KNN based on an average baseline of medical benchmarks. Tailored Prompts improves accuracy from 94.2 to 94.7; Ensemble (15x) improves accuracy from 94.2 to 95.5; 5-shot KNN decreases accuracy from 94.2 to 93.7.
Figure 4. Tests of different prompting strategies across benchmark datasets.

Spotlight: Blog post

Eureka: Evaluating and understanding progress in AI

How can we rigorously evaluate and understand state-of-the-art progress in AI? Eureka is an open-source framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. Learn more about the extended findings. 

Do results stand in another language? 

A chart with two bar charts measuring the accuracy (y-axis) by short and long questions (x-axis) on the Japanese Medical Licensing Examination. The short question bar is slightly higher than the long question ratio for o1-preview (0-shot ensemble). The short question bar is about two points less accurate than the long question bar for o1-preview (0-shot). The short answer bar is a point more accurate than the long question bar for GPT-4o (Medprompt). The short question bar is one point less accurate than the long question bar for GPT-4o (0 shot).
Figure 5. JMLE-2024: National medical licensing exam held in Japan (Feb 2024).

We expanded our research to include a new multilingual benchmark based on the Japanese national medical licensing exam. The JMLE (Japanese Medical Licensing Examination) is written in Japanese and administered in February 2024, after the o1-preview model’s knowledge cutoff. Even without translation to English, the o1-preview model achieved a remarkable score of 98.2% accuracy (Figure 5), well above the exam’s minimum passing score of approximately 80%.  

Do reasoning tokens improve performance? 

For fun, we conducted tests to determine whether increasing the number of reasoning tokens could improve performance. Our findings showed that by adjusting the prompt, we were able to consistently increase the number of reasoning tokens used by o1-preview, and the increase was directly correlated with improved performance as demonstrated in Figure 6.

A chart plotting the impact of reasoning tokens on accuracy. JMLE achieved 95.3% accuracy for Quick Response Prompt and 96.7% accuracy for Extended Reasoning Prompt. MMLU achieved 94.9% accuracy for Quick Response Prompt and 94.7% accuracy for Extended Reasoning Prompt. MedQA achieved 94.3% accuracy for Quick Response Prompt and 95.1% accuracy for Extended Reasoning Prompt. USMLE Sample Exam achieved 92.6% accuracy for Quick Response Prompt and 93.1% accuracy for Extended Reasoning Prompt. USMLE Self Assessment achieved 91.8% accuracy for Quick Response Prompt and 92.2% accuracy for Extended Reasoning Prompt.
Figure 6. The effect of two prompting strategies that elicit variable length reasoning chains across benchmark datasets.

What’s the takeaway? 

Bottom line: There’s a little something for everyone when it comes to run-time strategies. We’re excited by the performance gains from GPT models to o1-preview models. While these improvements are significant, so is the cost. For those needing proven accuracy on a budget, Medprompt leveraging calls to GPT-4 is a viable option for medicine and beyond. We summarize the relative performance of prompting strategies in Figure 7 to determine the best option, or check out the paper for a detailed breakdown of every dataset, experimental configuration, and prompt template (opens in new tab)

A matrix that shows the relative performance of prompting strategies over baseline medical benchmarks. The top row from left to right are the results for baseline numbers: JMLE = 95.6%; MMLU = 94.6%; MedMCQA = 81.4%; MedQA = 94.9%; USMLE Sample Exam = 94.0%; USMLE Self Assessment = 91.8%. The second row from left to right, 5-shot Random baseline difference: JMLE = +1.2%; MMLU = -1.1%; MedMCQA = 0.0%; MedQA = -1.4%; USMLE Sample Exam = -0.4%; USMLE Self Assessment = -1.0%. The third row from left to right, 5-shot KNN baseline difference: JMLE = +0.6%; MMLU = -0.1%; MedMCQA = +1.2%; MedQA = -2.2%; USMLE Sample Exam = -0.3%; USMLE Self Assessment = -0.6%. The fourth row from left to right, Bootstrap Ensemble (5x) baseline difference: JMLE = +1.5%; MMLU = +0.1%; MedMCQA = +1.3%; MedQA = +0.7%; USMLE Sample Exam = +1.3%; USMLE Self Assessment = +1.0%. The fifth row from left to right, Bootstrap Ensemble (10x) baseline difference: JMLE = +1.4%; MMLU = +0.6%; MedMCQA = +1.5%; MedQA = +0.7%; USMLE Sample Exam = +1.3%; USMLE Self Assessment = +1.1%. The sixth row from left to right, Ensemble (15x) baseline difference: JMLE = +1.5%; MMLU = +0.6%; MedMCQA = +2.0%; MedQA = +1.1%; USMLE Sample Exam = +2.0%; USMLE Self Assessment = +1.3%. The seventh row from left to right, Tailored Prompt baseline difference: JMLE = +1.6%; MMLU = +0.4%; MedMCQA = +0.9%; MedQA = +0.2%; USMLE Sample Exam = +0.0%; USMLE Self Assessment = +0.4%. The eighth row from left to right, Tailored Bootstrap Ensemble (5x) baseline difference: JMLE = +2.2%; MMLU = +0.7%; MedMCQA = +1.8%; MedQA = +0.8%; USMLE Sample Exam = +0.9%; USMLE Self Assessment = +1.1%. The ninth row from left to right, Tailored Bootstrap Ensemble (10x) baseline difference: JMLE = +2.3%; MMLU = +0.7%; MedMCQA = +2.1%; MedQA = +0.9%; USMLE Sample Exam = +0.9%; USMLE Self Assessment = +1.2%. The tenth row from left to right, Tailored Ensemble (15x) baseline difference: JMLE = +2.5%; MMLU = +0.4%; MedMCQA = +2.6%; MedQA = +1.1%; USMLE Sample Exam = +0.9%; USMLE Self Assessment = +1.4%.
Figure 7. Heatmap showing absolute accuracy and relative performance over baseline zero-shot prompt (in parenthesis) across all benchmark datasets.

Anything more to consider?

We highlighted several considerations in the paper that are worth checking out. Here are three opportunities that are top of mind:

  • Research on run-time strategies. The research community has largely relied on boosting model capabilities with data, compute, and model size, predictably achieving gains by way of scaling laws. A promising new direction is inference-time scaling—the value of investing in additional computation and machinery for guiding inference at run time. We highlight in the paper opportunities to guide run-time allocations to boost efficiency, accuracy, and intellectual capabilities, including meta reasoning and reflection in real time and learning during the “idle” time (opens in new tab) between problem solving. We see a great deal of opportunity for new research and development on real-time and “offline” reasoning, learning, and reflection.
  • Benchmark saturation. With the rapid advancement of state-of-the-art models, many existing medical benchmarks are reaching “saturation,” where models perform extremely well on standing medical competency challenges, considered extremely difficult just a few years ago. Current benchmarks, such as USMLE and JMLE, were designed to assess the performance of medical students and clinicians and are increasingly inadequate for evaluating cutting-edge AI models. To drive understandings of models and guide research, we need to design more challenging medical benchmarks.
  • From benchmarks to clinical applications. We note that, while benchmarks offer valuable insights into performance and accuracy, they often fail to capture the complexities and nuances of real-world clinical decision making and healthcare delivery, more broadly. Conducting clinical trials to rigorously evaluate the impact of AI applications on patient care poses far greater difficulties than benchmarking models against challenge problems drawn from medical competency exams. Yet, studies of AI deployments in realistic clinical settings are essential for understanding model capabilities and for guiding the effective integration of AI into healthcare.

Resources 

Related publications

Continue reading

See all blog posts